Friday, April 14, 2017

lrs services

lrs services

good afternoon. come on, good afternoon. >> good afternoon. >> thank you and thank you all forcoming today and spending your afternoon,i really appreciate it. i'm adam glick, i'm a senior programmanager focusing on resiliency in azure. and we're gonna spend today, or atleast the next 75 minutes of today talking about designing yourapplications in azure for

high availability and scalability. making sure that you canhave your apps, be online, keep your up time as highas possible within in it. i put my contact information onthis slide, that's not required. i do that because iwanna hear from you. so both there's a slide atthe end where i'll talk about filling in your evaluations,i hope you do. but also, if you have any questions,to follow up afterwards or if something comes,you can reach out to me.

whether is on github, linkedin, myemail at microsoft, or on twitter. i will tell you that twitter isprobably the one that i check least. but use whichever one you like toget a hold of me because i certainly want to hear what you like. what you like to hear more of,what you like to see us doing so that we can make better content forall of you. that's what i do. with that, let's get started. three major thingswe're gonna go over.

designing your applications forhigh availability, using the azure platform, understanding best practices,what you should be doing. and also, understanding what aresome of the things you might not be aware of that are pitfalls andhow to avoid those. to make sure that you cando things the right way and not get bitten bysome of the gotchas. that sometimes people run into thefirst time they design something in the cloud versus on-premises,they may be more aware of.

and then talk a little bit aboutsome of the inner workings of azure, what's actually happeningwith some of the services. and how we've designed some ofthese things for high availability. one note is that i spend most of thetime in this session talking about iaas, infrastructure as a service. and building that out asopposed to paas services. paas services for the most part take care ofthe high availability themselves. there are some thingsyou can do there but

if you have questions about that,ask me afterwards. but i'm really gonna spend mostof my time talking about doing infrastructure as a service. how you do that with highavailability in azure. with that let's get going,high availability, what does it mean to you? everyone actually hasa different definition of this. i mean, if you think about uptime,ask yourself, what's the availability of theapplication that you wanna do and

that's a business requirement. i won't drill in too much into howyou define that because it's gonna be different forevery person in this room. but basically, remember, thatyou're gonna define what that is. and that's gonna define howcomplex a system you wanna build. how many nines youwanna be designed for, what level of fault that you wannabe able to handle with your system. and so think about what thosedifferent things are because it's gonna alter what architectureyou're gonna use.

and we'll talk about a number ofdifferent architectures today and how you'll use thosedifferent architectures. to solve for different levels offaults that you might have happen within an organization. this is part ofthe resiliency continuum. so some of this iskind of foundational, i just wanna make sure we're allon the same page about stuff. so resiliency continuum isbasically everything that has to do with keeping yourapplications online.

this talk,talks all about high availability, i'm not gonna talk about backup ordisaster recovery. there's some links atthe end of this deck, that will talk about sessionsthat cover that content. another talks here at ignite, that you can check outthat will cover those. but the thing that i want youto take away is, basically, rpo and rto. that's recovery point objective,recovery time objective, for

high availability, the target is 0. so the goal is, basically,you are always online and any fault that happens doesnot take your system offline. and that's what we're gonna talkabout designing systems for today. if you're in a situation wherethose have gone offline, you're in a disasterrecovery scenario. and you can check out the talk thatwas done yesterday that talks about those pieces on it. so think about how do you makesure that as things happen,

as servers go down, as disks fail. as you get high levels of traffic,you can make sure that your services continue to serve upthe applications that you've built. so when you think about the cloudand you think about doing that, you need to start thinkingabout something that. for anyone who's worked inmanufacturing, anyone worked for a manufacturing company? quick check, few folks in the crowd. you're probably very familiar withmtbf, mean time between failure, but

that's what you need tostart thinking about. because you're gonna havea lot of different servers and as the servers run. you're gonna think about, how long does it take betweenwhen these things fail? and the reason you wanna think aboutthat is the higher that number is, the bigger the amount of timebetween when things fail. that's how less likely,that failure is to occur. then you need to think aboutthat when you're designing for

your availability. the more rare that occurrence is,the more you have to decide, is it worth the cost of designing a systemthat's gonna protect you from that? the more common that failure is,the more you're gonna wanna be able to build a system that'sable to handle that. something like a disk failure orsay, dropped packet in a network might be fairly common so you wannamake sure you design for that. something like a data centerburning to the ground, hopefully, is a very uncommon thingthat happens for you.

and so, you have to decide do youwanna be able to design something that spreads across regions,multiple data centers. we'll talk about all those differentdesigns and how you do it. but you'll make those decisionsbecause they'll impact the cost and the design or your architecture. how complex it is, how many different things you needto manage and setup automation for. this is true foron-premise and on the cloud, obviously, we'll be talkingabout the cloud here.

but a lot of the principles and the design techniques we talk about,will be the same. although the technologythat i'll be showing you, is how azure does those things,and makes it easy for you. and just remember,failure of hardware is inevitable. when you say mtbf,that's mean time between failure, because everythingfails at some point. never assume that somethingwill always be available, something will fail, always.

it's just a question of when,so plan for it. so some good design principles. when you design for the cloud andyou think about kind of, okay. what are the things you shouldthink about architecturally and then doing it. we'll step through a bunch of theseand then we'll drill in deeper into actually talk about howactually make them happen. but if you take nothingelse away from this talk, no single point of failure.

so i run into this one all the timewhen i'm talking to people and everyone conceptually gets that,right? if you've got one machine,we've been doing this for years on-premises. you always have two machines and you link them together to make sureyou have no single point of failure. remember, that it's not just the virtualmachines you need to think of. it's everything inthe infrastructure of your building.

we'll talk later on about how to make sure you don't singlepoints of failure in storage. when you think about whatyou're doing in networking and load balancing til you haveno single points of failure. it's no single points of failurethroughout your entire stack and again, when you talkabout that blast riggus. of how big a event you wannaprotect yourself from, you also need to thinkabout what does that mean, in terms of, is a region,a single point of failure for you?

do you wanna be able to survivesomething like a region service disruption cuz that'sone single point? and if you wanna do that,there's some design characteristics that we'll talk about howyou build those systems. but again, different systems,different size and scale that you wannabe able to design for. next one is, your state should stayat the edges of your app stack. when i say app stack, it's important to think thatit's not just the edge.

you don't want that always onthe client, but you think about, basically, the ends. basically, your presentation layer,your client, or the database layer. don't pu it in the middle,you want stateless services. so make things as statelessas possible, best place, usually, is to storeyour state in a database. if you're something thathas a lot of transactions moving fairly quickly, you may decide to put your state insomething like a no sql database.

versus storing that your transactiondata or your financial data and put those in something likean rdbms, a sql database. that you don't haveto pick one database. you picked the right thing but make sure you store thosein one particular location. don't store them throughout the app. if you create staple applications, you're creating thingsthat are brittle. you're creating things that,

when those pieces ofthe application fail, your customers will be impacted. so keep your state awayfrom the center of it. it will allow you to be morescalable, it'll make it faster, we'll talk about that. talked loosely coupling yourcomponents if at all possible, make sure they're loosely coupled. what i mean by that is if you'regoing in and your passing state along through, so you'vegot a monolithic three tier app.

everyone's built one at some point,three tier app, yes, yes, hopefully, we'll talk a lot about those. when you have a three tieredmonolithic app in your storing state like you have store state onthe server then if that goes down, you're impacted. it also means that if somethingelse in that stack goes down, if things are reliant upon that. you've impactedeverything in that chain. if you make things looselycoupled then losing one part

doesn't fall awayfrom somewhere else. so this is a little bit more ofa service oriented architecture or microservices, we'lltalk about both today. but if you think about it, if youtake microservice and basically, in step passing thingbetween load balancers. was typically, what you'll do between tiers tomake sure you have scalability. you're gonna use things like queuesbecause if we put it in a queue and that exterior isn't available for

some reason,it'll just stay in the queue. and when that exterior becomesavailable, it can pick it up. you're resilient to that failurethat happened versus if it goes to a load balancer and there isnothing below that load balancer. if that tier has failed for somereason then your service is offline and it didn't matter that yourother tiers were available. because you've lost it, your customer packet justgoes to nowhere and dies. so think about using queuesinstead of load balancers.

it does require a differentbit of architecture and you have to choose what's right foryou. but that will help you build things that are looselycoupled versus strongly coupled. process your content centrally andthen deliver it locally. think about your data, if youdistribute your database everywhere. you're gonna have all these bitsthat you have to piece together. it's super easy with the cloudto put it in one place. process that and then send the dataplaces where you wanna deliver it.

so if you have all your data inone central place, you're gonna avoid some of the challengesthat you have with databases of spreading them out. we'll talk later about how ifyou need to spread them out, for any number of reasons, includingdisaster recovery, how you do that. but that's gonna makeit more complex. again, everything's ona continuum of the complexity and the cost you're doing versus yourability to move quickly and easily. so you process it centrally butdeliver it locally and

whether that be making surethat you process it in a data. and then, say, putting it out in astorage file that you pull it out of or you have a secondary database. that you've actuallyput that database, that data available in locationsthat are closer to your customers. or whether it be using somethinglike a cdn and actually having the cdn cache that contentthat's closer to your customers. they'll get a betterexperience out of it, you'll get morescalability out of it.

and we'll talk a little more aboutthat later in the presentation. finally, just a good kind of,or not finally, but the next one is a good ruleof thumb about automating. make sure you automate everything. i know, especially,when you're doing development, it's super easy to go in there, makea quick configuration change, or just develop and push it. it's a bad habit. it's fast, it's easy,it's not repeatable.

if you take a look at failuresthat happen with systems, you take a look into even datacenter level failures, yeah, if you take a look atindustry-wide statistics, data center wide failures,over 50% of them are human error. people make mistakes. we should manage code. code should manage machines. get in that habit, you'll beable to be more scalable, you'll be more available, you'll be ableto bring your systems up faster.

you can do that and we'll talk alittle bit about how you can do that in azure andusing arm templates to do it. me personally, i'm a fairly trusting guy, butstick with trust and verify. everything you do,make sure it's actually happening. we'll talk about how you monitoryour services a little bit. we'll talk about what you'redoing with load balancing, make sure things are up. that you have to make sure that notonly you built the system that you

intended to build, but that you'reverifying that that is happening and that it's staying that way. and there's a number ofdifferent services that you'll verify different ways. and we'll talk through each ofthose about how you do that and make sure that you're doing it. but you have to verify not onlythat your system is there but it's delivering whatyou want it to do. and the last one is really one ofthese big changes that's happened as

you move to the cloud,versus you built things on-premises. and that is, think aboutscaling out versus scaling up. so drill into that a little bit. when you think abouttraditionally on-premises, how many come froman on-premises background? almost everybody? cool, so traditionallywe've scaled things up. so easy example of thisis the database server. you've got the database and

you bought the biggestdatabase server you could. and you bought the biggest databaseserver you could because you can't scale the database server. technically you can,you can go and you can shard it, but that's a whole lot of workthat you probably don't want to do unless you hit the pointwhere you have to. so you bought a really big box and it was probably highlyunderutilized, and you made sure that you putthe best drives you could,

that you got the best network cards. you put redundantpower supplies in it. you did everything cuz you needto take care of that server, and that was scaling up. and if you hit the point where allof a sudden you reach the limits of that server,you probably went out and you bought an even bigger server. that's scaling up. you can do that inthe cloud if you want.

we make it easy in azure, if youwant to go with an instance family, and just scale up, move somethinginto a larger instance size. but the problem with that is one, your cost, cuz that's a cost thatyou do, isn't a linear scale. if you spin up a second server,a second server is the same. you just double your cost of one. but if you scale it up toa server that's twice the size, a twice the size serveris not twice the cost. it's much more than that.

so you need to think aboutwhat's the cost impact of what you're doing. when you scale horizontally,you don't have that impact. the other side of it is,frankly, that it's brittle. when you scale up, yes,you may be able to handle the load. so you've handled the scalabilityissue that you want. but you still haven't handledthe high availability piece of it. you still have one server. and when that one serverhas something happen to it,

remember hardware will fail,bad things will happen, and when that goes down youdon't have another one there. or if you do,that's extremely costly. if you have it spread out,if you have multiple servers, you'll avoid that. and we make that easy with things. the vmss stands for vm scale sets, we'll talk about thata little bit later. but that's basically an easy way foryou to define what a tier of your

application is, and how you stampout, and how you grow those things horizontally, andhow you autoscale those. but the best description i've heardof this is the description of pets versus cattle. so when you think aboutthings that scale vertically, those servers that you know,think of them like a pet. you probably know the name of them. i'm willing to bet if you ran anyon-premises infrastructure that you actually knew the name ofthe server that you would go and

you would connect to. you took care of that server. if there were problems withthe server you would log into that server. you would check, hey, what's causingmy icpu, what's causing ram usage or my disc to be highly utilized? and you'd be going in there andyou'd be fixing that server, cuz that server is somethingthat you needed to take care of. and that's the vertical scalingview that you take on things.

that's very differentthan a scale out version. the scale out versionis like cattle. so cattle you don't give names to. cattle are basically a commodity. if you need more cattle,you buy more cattle. if you need less cattle,you make some hamburgers. [laugh]it's super easy for you to go back and forth,and the cloud enables that. what it basically is, it removesthat capacity concern, and

that i need to have this machine,and i need to keep it running. the easy example for this iswhen you think about patching. so in the world of pets, in theworld of on-premises servers, you've got a new patch, and you've got yourpatch that comes out each month. it's just as true forlinux as it is for windows. you've got your patches andyou're gonna go patch that server. what do you do?you're running some sort of management suite, that you're goingand you're patching that server. you're logging in,you're doing it manually.

you're going in and you'repatching that server all the time. you're taking care of it,it's a pet. in the world of cloud,you don't need to do that. you've got a new server. there's a new version coming out. you've already got thesethings templatized. you just spin up a new server,whatever it is that you want. it's already patched. it's already got the latest version.

if you go into azure and you select,say, windows server 2016, you launch it andit is the latest patch version. and then you installwhatever pc you want. you connect it inyour infrastructure. you take your old one andyou just throw it away, you kill it. they’re cattle, they’re commodities. and so it allows you to thinkabout how do you manage and scale your infrastructure. that’s the way you scale outyour infrastructure, and

that’s the way you canmanage your infrastructure. and that’s why horizontal scalingmakes it really easy and a much better way to manage things atscale as you work within the cloud. so that said, that means thatthere's certain workloads that work really well in the cloud, andthere's certain ones that you want to think a little bitmore about before you do. so things that workreally well in azure, things that work well in the cloud,are things that you built. because when you go andyou build a new application,

you can follow the principles thatwe're gonna talk about in this talk. and you can build it so that it'sbuilt for the cloud, that you're using the right principles tomake sure that it's scalable, that you're taking advantage ofwhat the cloud provides to you. if you're using something that'ssoftware that someone else built, it can be designed for the cloud. if you are, then you're fine,but a lot of legacy applications, how many people run an applicationin an organization that is at least, say, six years old?

so those are going to bea little bit more challenging, because they probablyweren't designed for a cloud. they probably were designed for a horizontal scalingwithin an organization. now you can lift and shift those tothe cloud, but what you're gonna do to maintain high availabilitywith those is gonna be different. you're gonna have to think strongly about the databaselayer when we talk about databases, and how you make sure thatdata availability is there.

but there may not be a lot that youcan do for, say, the front end or the logic tiers. if they've designed those tobe stateful applications, t hey are somewhat brittle, andthere's not a lot you can do about it because you don'town the code for that. so you're gonna have to think aboutthat as you design and as you think about what's the availabilitytargets that you're doing with those older applications if you lift andshift them to the cloud. so the last one is designing arounda single-state datastore, and

that's what we talked about before, about making sure thatyou're storing state and you're keeping that in a data layer. that you can keep your applicationsas stateless as possible, because that's gonna giveyou the most flexibility and ability to scale andhandle faults as they happen. so today's talk, we're gonna go through three basicareas, compute, networking, storage. these are three kind of core areasfor your infrastructure, and these

are three different areas whereyou can have faults that come in. you can have failures that happen,how you're gonna design and protect around each of them. you could have a faulthappen at any one of them. or, due to my awesomepowerpoint animation skills, you can have faults on all of them. so let's talk throughwhat you can do. let's start with compute. so, first thing, when you'rethinking about your compute,

i'll ask a dangerous question here, how many of you have all of yourvms running in an availability set? okay, if i can get everyone [laugh]in this room to move your vms into an availability set, you willdo yourself a great service. so we'll talk a little bit aboutwhat that does for you, but the important thing thatyou need to know is you need to createthe availability set. it needs to exist whenyou create your machine. if you have an existingmachine right now and

it's not an availability set,you can't move it into one. you'll have to re-create the machinein the availability set. every time you create a machine, youshould create it in an availability set, even if you're onlycreating one machine. this will allow you to plan forthe future, to plan for future scalability, and to protectyourself for availability. because if you want the azure sla,which is on vms, you need to have at least twomachines in an availability set. so it's important thatyou have those there.

even if you're goingto run one machine, you decide that you're notgonna put yourself up for the sla, that's okay. but then realize that yourability to scale up in the future means that you need to have thatavailability set already created. it's already there whenyou go in and create a vm. it's there in the portal. you can just choose availabilityset, create a new one, or if you're doing it command prompt orin an arm template,

you can just create it there. but you have to makesure that you do it. put everything inan availability set. just make that your new normal. and the thing to remember isdon't put them all in the same availability set, so it's tiedto role in your application. so for instance, all of your webservers is part of an application. basically everything that could bea carbon copy of the other ones you want in availability set.

so all of your web servers wouldgo in one availability set for the application. all of your business logic, theywould go in an availability set. so you'll do separateavailability sets. you won't put them in the sameone because an availability set, you want to make sure that'show it scales things out. that's how we know that we'reprotecting your application from the various layers offaults that can happen. so the big thing it does isit spreads you out across

fault domains and upgrade domains. so these are two things that youneed to be aware of in azure, cuz these will helpincrease your availability. so here's a chart. i put these both up here becausesome of you may be running in the classic portal, that's asm. and some of you maybe running in arm. that's the current portal,portal.azure.com. so if you go in there,the numbers are slightly different

between the two, and that's whyi put both of them up there. but the thing to understand is thedifference between fault domains and upgrade domains. so you can think of a fault domainas a dedicated set of hardware. you can think of it as a rack ora cluster of racks, but it's a set of hardware. and when you spread yourselfacross fault domains, that means that you'veeliminated the ability for a failure at that level toactually impact your application.

if the rack went down or ifthe cluster went down, if you spread across fault domains, that meansthat, that one vm might be impacted. but if you've gotan availability set, that availability sets spread youacross multiple fault domains, your other machines will be okay. if you didn't use an availabilityset, you could all end up in the same vault domain whichmeans, if a cluster went down for some reason or a rack went down,you could lose all of your vms. so this is a way of makingsure that an unplanned failure

at the cluster level will notimpact all of your machines. that you can separate them out, that's what fault domainswould do for you. update domains are aboutplanned updates. we do patching every month. sometimes more frequentlyif there's a hot fix. so, we're doing updates. now, just like youupdate your machines, your machines run on our hosts.

our hosts need tobe updated as well. we need to protectthe infrastructure that all of you runyour applications in. and so, when we go and we do those updates,that can impact the machine. now, we've done some really coolstuff recently in order to make those updates as fast as possible,sometimes as little as 30 seconds. it can actually happen fasterthan the timeout on a tcp packet. but realize thoseupdates will happen.

there'll be updates thathappen underneath it and so, to make sure that yourapplications aren't impacted, if you're in availability set. and if you've spread yourselfacross update domains, and simply a slider in the portal, you can just choose how many updatedomains you want to have available. that will mean that we doeach of those one at a time. so you can choose what percentage ofyour application might be impacted, you can limit the scope ofwhat could happen to you.

so in arm, which is what ibuild most of my stuff in, and i'd recommend if you can to startbuilding all of your applications in arm, you get a choice ofup to 20 update domains. so you choose that a very smallpercentage of your infrastructure would be impacted by that,maybe about 5%. fault domains, you have three. so, if there wasan unplanned failure, i'll take questions at the end,that's okay. if you had an unplannedfailure that would happen,

it could at most impact one thirdof whatever's in that fault domain. so, it's important to spreadacross both of these and when you set up an availability set,it'll let you choose these numbers. so you can set that up andhave that configured and then when you start up your new vms, they'll just automaticallybe positioned in these. if you set up an availability set,your next vm when you start it up will automaticallyspread to the next one. you don't have to choosewhich update domain,

which fault domain they're each in. we'll do that for you. but you have to make sure you've setup an availability set in order for us to do that for you. another reason why making sure toset up an availability set is key to making sure you get the most ninesout of your availability with your applications in azure. let's talk a little bitabout deployment scale. so, one vm.

every application starts here. i know all the ones thati've ever built do. you have one machine,you go, you built it. hey, it works.that's like development work. and that's great. that's not highly available. you're back to that brittle state. you got one single point of failure. if you take nothing else away,single point of failure always bad.

so you put in a second machine. okay, we got two machines. what do you do about that? are they in the same fault domain? same update domain? like in the last slide, you probablyjust put that in availability set, set them up regardless of howmany you're gonna have there. have that configured so you can. but once you have that,it's important to realize,

those machines, how does yourtraffic know how to get there? that's putting a loadbalancer in front of it. so that's that three arrowdesign that i have above there. you're gonna think about alwaysusing a load balancer because if you've got an availability set, load balancers have a thingcalled back-end pools. and back-end pools, you canpoint to an availability set and basically say, hey,the traffic comes here, it can go to any of thesein the back-end pool.

and i'll show you some moreabout that later but realize, that's the way you handle that,it means you have multiple machines, making sure your trafficgoes to the right place. that's pretty quick and simple. as you start to scale up form there. so we'll zoom out a little bit. so now you're taking a lookat your whole application. so, the person there is your user. the arrows there, you're seeingkinda pointing at each other,

that's traffic manager, that's dns. so, by using dns, azure dns isproblematically accessible. so you can go andyou can configure that and edit it as you want, when you want. so, users will hit your dns, dnswill point it at a load balancer. you always wannahave a load balancer even if you have one machine whichof course, will be available to set, you wanna use a load balancer. and the reason for

that is very simply, what happenswhen you want your second machine? if you use the ip address,that's tied to your particular vm, then when you startup the second vm, how are you gonna handlethe load balancer? you want to put the loadbalancer in front of it, even if there's one machine becauseyou're designing for scale. you're designing for, how do i make sure in the futurei'm always protected? so they got load balancer, in thiscase, we set up a virtual network.

anytime you're designing a system,start with a network. you don't wanna getto the network later. you're gonna definewhat the network is. you're gonna define the sides androles on that. how big your subnetsare going to be? don't think about whatthe subnets you need today is. think about the subnets you needas you scale out the application. because once you set the subnets and you set the addressing space forthat.

re-doing that, changing that,is a lot of work. save yourself the trouble,set it up when you start. if you use the default,cuz you'll always be put, if you're doing it in arm, you'llalways be put in a virtual network. if you're using the default,remember everything that you start, is gonna be put inthat default network. you probably don't wantall of your applications put in the same network. you wanna be able tomanage what those are and

how those networkstalk to each other. so, start with the network layer andplan it through. be planful about what youdo with your subnets and make sure that you define them foreach tier of your application. even if you're not using them today, you can add subnets to things, youcan add subnets later, but moving things between the virtual networks,that's a much more painful process. so avoid that, in this case, we put everything in one bar,in one subnet, that's great.

so how do we scale it up from there, what's the next layer as we startto grow and scale our application? well, now we'rebreaking out a little. so we've got the front endlike we had before, but now you notice there'sa second load balancer, we're using an internalload balancer. so now we've got separationbetween two tiers. we've got a data tier,sql servers, and those are sitting in a differentsubnet, so i can protect those.

this isn't a network security talk. but if you talk to someof the security folks or come find me afterwards. i can talk to you aboutthe importance of separating those pieces and not puttingeverything in the same tier. and you have a loadbalancer between the two. so you can balance your trafficas it goes between them. we've also addedan active directory. so you can think about what you'redoing in terms of authentication and

management of userswithin this environment. and so we're starting to scale up. this probably looks a lot likesome of the things that you may be used to today. you'll also notice, there's noone single server for anything. no single points of failure. even with active directory, we've got two of themin a cluster together. and then finally, let's thinkabout how we scale up from there.

because our single point of failure,and all that, which is kind of hidden detail,was the region that you were in. that was all in one region. so what happens if we wanna lookat moving this to a second region? not moving as much as expanding it. and so here, you'll notice,here again why the dns is important. the dns at the top, is thenspreading you to load balancers, and you're pointing it totwo separate regions. and in this case,the one we're looking at,

is kind of an active/passive design. what's the dead giveaway, for an active/passive versusan active/active design? we'll talk more aboutthese more later. one, we're done asynchronousreplication, which means that the two sql databasesare not up at the same time. well, they are up at the same timebut the data is not synchronized. and you also notice thatin the second region, we have many less machines.

it's not the samedesign in both places, you have what is kindof like a pilot light. you have another regionthat's set up and if you need to, you can scale it upwhen you need it, but you are not running a bunch of infrastructurefor no reason in that second one. that's an active/passive design,we'll talk about both active/passive and active/activea little bit later, but that's the deadgiveaway in this one. so, i talked about vm scale sets and

i wanna drill in a littlebit more on that. this is your way to doauto-scaling in azure. so, we talked about the cattle,you have the same set of vms, so it's by tier. the same way that you didyou're availability set, the same kind of rules apply. so you're saying, hey, this is myweb tier, this is my logic tier, whatever it is, it's the machinesthat are all the same and you'll define that.

you'll define the operatingsystem for it, both windows or linux that you're gonna be using forthat. you'll define the rules around howyou want that scaling to happen. you can do this throughan arm template, you can also do itthrough the portal now, there's portal experience. and you can set things like, what's the maximum numberof machines i wanna set up. this is really useful for protectingyourself from things like,

because the way it scales is youcan set it for some sort of metric. so let's use cpu utilization,is a good metric, and you say, hey, if i'm at 75%cpu utilization or more, i probably want to scale up,so start up another machine. so go spin another one off. and you can say,the inverse as well. if my machines are running atless than 10% cpu utilization, these are configurable of course,then i wanna scale down, take one out of the set.

so you can scale. but also gives you a maximum limit. so let's say you had a bug,that pegged your cpu. so it spins up another machine,hits the bug, pegs the cpu. another machine, you don't wannawake up in the morning, and find out there's 1,000machines that got spun up, simply because it just keptscaling and scaling, and scaling. so we give the ability for you toput a cap on it, and you can set up monitoring as well so you can bealerted if somethings happening.

but we give you the ability to kindof protect yourself from something like that might happen. but set these up, because isyou set up the vm scale set, again that means you'rescaling automatic, rather than you having to go in andsay, hey, i'm noticing i'm getting a lot of cpu utilization,we're getting a lot of traffic. i wanna go start up a machine. that's manual. you have to do that.

we talked about automation. this does the automation for you. you set the rules for it andit will do it automatically. so, a couple things you needto know about vm scale sets. so, vm scale sets,like everything, have limits. so, everything has limits,the only question is what are they? in this case, 100 machines currentlythe limit with a vm scale set. so, if you're running anapplication, like a very large web application for instance,you may use more than 100.

that's fine. you just set upmultiple vm scale sets. that totally works. so set up multiple vm scale sets and again, don't want a singlepoint of failure. so even if you're not doing that. you may say, okay,i only have 50 machines. great. set up two vm scale sets, and set up two vm sale sets,each targeting 25 machines.

you still have the 50 machines,but now you're protected. if one of those fails forsome reason, something happenedwith your vm scale set. you've got a back up for it. it's all about the redundancy and about building thatinto the systems. so in this case you can seei've got one vm scale set. again, put it behinda load balancer. load balancer is the back-end pool.

you can point it at that. in this case i use three. you can do two, you can do three. but in this case,we're all in one region and i now have the redundancy both atthe vm scale set level as well as at the virtual machine level. so, let's take it to the next level. okay, that was great for one region. what about multi-region?

the answer is the same. you're gonna do the same thing byregion, but you're gonna set it up in each region, and then you'regonna use traffic manager. and traffic manager is gonna point to load balancer ineach of those regions. so now you're being able to spreadyour traffic across those two regions, and have redundancyboth at the region layer, and within the regional layer. so more redundancy,slightly more complexity, but

it allows your application tohandle bigger and bigger faults. so next one is network. let's talk about the network. so, load balancing. everyone is using a load balancer,right? yes? that's all right,i love these easy way to find it. like, i always use a load balancer. so it's important to understandthe difference between a layer 4 and

a layer 7 load balancer. in azure we offer you both, so if you use the azure load balancer,both internal load balancer or the software load balancerwhich faces extra noise. you're using a layer4 load balancer. now, what that means is that it'sworking at a the packet level. so it's taking the packets and it's just routing them fast ascan be, it moves them around. it's very scalable,very efficient, and

it will take what evertraffic you put in it. it's not only for things manypeople are familiar with layer 7 load balancers which in most casesnormally deal with things like http, which is great butsometimes your traffic isn't http. you might be dealing witha different protocol. so, this allows you to load balanceall of your information on that, but there's things it doesn't do. a lot of times people wanta layer 4 load balancer, they might want a layer 4load balancer in addition.

so if you want an http load balancersomething like application gateway, we'll do things like ssl offloading. now remember,ssl is only there for http. so it's not there forall your different packets. and so, you need to understand thatyou can do that ssl offloading. application gateway isthe way to do that, and that's a layer 4 load balancer. now, you also may choosetheir third parties. i've listed a number of them here,

nginx, haproxy are some ofthe open-source equivalent ones. you've also got proprietary ones f5,barracuda, they're available in the azure marketplace,and there's great reasons why you might wanna choose these, but ifyou choose to do them you're gonna need the architecture that i'veshown up on the slide here. this is actually what applicationgateway does under the covers. we do it all for you, so you don'tsee it, but this is what we're doing and if you're choosingto use your own load balancer, any of the third party ones,you need to use this architecture.

because remember if youhave a load balancer and you've spun up a loadbalancer basically, it's just another machinethat's running code in it. understand that once again, you have a single point of failureif you're running one of them. so what you do is you run thesoftware load balancer in azure, and that's just gonna do the packets,and then you create yourselfan availability set. and that availability set you'll putmultiple copies of whatever load

balancer you're choosingto use in that, your layer 4 loadbalancer's in there. and when you do that, again, you'veavoided that single point of failure because if one ofthose were to fail for some reason you always have theother ones that you fail over to. always avoiding that single point offailure if you're using something like app gateway,you don't have to worry about that. we handle that for you,we do the scaling. same thing with the layer 4,we take care of that for you.

but if you build your own or you'reusing other ones make sure that you use this design because it will helpyou avoid that point of failure. so, let's talk about the differentstages of availability and the different stagesof load balancing. for this, it's probably easier forme to actually do a demo. there we go. so, i'm gonna go here and we're gonna take a lookat what this looks like. you're gonna see big glory ofmy 1997 html skills, here.

but i built this page, and what you're gonna see is there'sa number of different pages. they're all the same pages,running on different servers. what i've done is i just changedthe text on those servers, so you can actually tell what it is. cuz doing a high availabilitydemo is really difficult, cuz if everything works it justlooks like the site's still up. so to help you understand what'sactually happening underneath the covers i've changedsome of the text so

you actually understand what it is. so here's me smiling away andtelling you that you can tell where the server is, this currently ismy first server over in us west. so okay, andif i hit this server and i refresh this page, you can seethat i have two servers there. what am i doing? i've got a load balancer. got two servers that are sittingbehind it and it's load balancer is just round-robiningbetween those two servers.

that's what's happening, that'sthe design that's going on here. so, what happens if a server fails? i get to do here what i alwayswould wanna do in production but never would. which is, i'm gonna go in, and sothese happen to running on linux. but it'd be just as trueif you were running ias. and i'm going an nginx the webserver that i'm using here. so i'm gonna kill the web server,which is oddly satisfying, i do have to tell you.

so, you kill the web server,it's dead. so now what happens? well, let's flip over to page. and we'll load the page. and i can keep hitting refresh and you'll notice i'm onlygonna hit server 2. why? i just killed server 1, of course,i'm only gonna hit server 2. so that happens right away.

so that's what you do when i sayput things in an availability set, put them behind a loadbalancer this is why. you failover,you failover instantly. and so, there you go, soi have the other machine up there. and it's running away, and it willserve these pages until i bring in other one up andput it into load balancer. remember we wanted to talkabout different blast radius. what are we protectingourselves against? i protecting myself againstone machine taking a hit here.

well, gee, what happens if i thendecide to be ultra-nefarious, and go take the other one down? so now,i killed our second web server. so, we saw we were going to a loadbalancer, we saw that it would only hit those two different ones,and now they've all gone down. so what i've kind of createdan example of this what would happen if there was a regionlevel service disruption? right, something thatwould impact all of them. and now it's gonna happento my web application.

and of course, that's a teaserthat i'll get back to. [laugh]cuz i wanna talk through some of these pieces. so, rio,what do you if a region fails? that's what have essentiallyi've just kind of shown you, is i created a region failure byjust taking down all the resources that are in that region. so this is one you can kindathink through to yourself of what would you do?

would you use a load balancer toshift to additional resources? would you use azure dnspoint at a paired region? would you use traffic managerto change the regions? would you startup if you'reusing azure site recovery, would you start your run book? start a failover? and then of course, you could always just go complainto twitter if you'd like. [laugh] one of these is probablygonna be your most effective option

that you go do. and what you're gonna wanna do,just go and use traffic manager. remember all of the designsthat i put up earlier, we always had those kind ofarrows pointing at each other and it was pointing attwo separate region. that's traffic manager. that's kind of part of azure dns. separate than azure dns, butit works together with it. and you're gonna wanna use that inorder to be able to initiate your

ability to move your trafficto a different region. so it works, you can conceptuallythink of it like a load balancer at the dns level, butit works differently. and i wanna talk to you about howthat works because it's important that you understandhow they're different. and don't just think of them,they're load balancers, they work just likethe software load balancers. so, i'm a visual person, i love animating slides cuz that'show i learn so hopefully this will

help you kinda understand whattraffic manager actually does. so we got traffic manageron the left here, and we've got your virtualmachines on the right. so, traffic manager basicallyjust hollers out, it's like hey, you there, andit does that do an htb get. just sends a get across and you canspecify that in traffic manager what the address is,what you want it to do. it's usually good practice tohave that point as status page. you can put it at your regularapplication if you want,

the reason why you'd wannado a status page is because the status page one, you don't have to exercise the wholeinfrastructure if you don't want to, you may be doing transaction stuffthat allows to see is it up. but also a status page,you can do something smart with it, you can have it monitoringyour entire infrastructure and change the status that it returnsbased upon the whole infrastructure. if you just hit the front-end page,your database could be down, the front-end page still returns anhtp200 code, everything looks good.

and it thinks that your applicationis up and running even though your application isn't up and running,it's not actually functional. if you have a status page andyou've set that application to actually check the differentlayers of your application. and if those all don't work,it doesn't return 200 status code. then it can actually be smart aboutwhen it does the failover and how it does it. so that's the best practicethat you should do. you don't have to, you can justpoint at the front-end if you want.

that's what i did for this application cuz there'sno database behind it. but it's gonna respond, sothe machine's like yep, all good. here's your 200 code, ack. it's all good.so, that happens, it gets 10 seconds to do that. so it waits 10 seconds forthat response. and then it basically chills for20 seconds. it hangs out, so you basicallygot 30 second intervals.

and then it says, you know what? what happens if we kill the machine? so that's what we just did. we just took down that region,so it can't access that anymore. well, traffic manager is reallygood at doing one thing, and does the same thing over and over. it's like if you've ever had one ofthose drinky birds that just sits there and just drinks the water andgoes back and forth. that's what it does, and soit only knows one thing.

the whole world's a nail,it's a hammer. i'm gonna go request it. so it goes andit sends it over again. but now the machine's not there. that region's unavailable, or at least your resourcesin it aren't available. just like what we just did. so it gets no response. waits for 10 seconds,doesn't get it, okay.

no response andwe wait for 20 seconds. tries it again. still no response. trying it again. so, it's gonna keep trying. it always keeps trying. it's all it does but,after four tries, it says to itself. four tries, has nothing's happened. so, remember?

it takes 10 seconds to wait forthe time out. 20 seconds in-between. so you got 30 seconds of hold,times 4 that's 2 minutes. after 2 minutes of getting nothingback it finally gets a clue. and it's like okay,this thing is actually done. it wasn't a lost packet, it wasn'tsomething that was just a temporary blip, this thing isn'tresponding any more. so, i'm gonna go and i'm gonnaupdate dns, and it goes and it changes your dns.

now, it changes it to whatyou configured it to. so you configure it, so youconfigure it to your secondary site, that's your second region. or third region or fourth region,you can choose what you want, but it goes into updates dns, so when someone makes a request now,it's gonna get that new request. there's also a thing called ttlthat you need to be aware of so, ttl is time to live. what that means is that's how long

when someone has madea request to your server that it should hold on to that causeour servers all think in numbers. ip addresses,we usually think in names. so if i'm doing somethinglike traffic manager or traffic memo manager.com. i's gonna do that andi don't want that latency and i don't want that load on my dns. for it to look it up every timesomeone is making a request. so the ttl is how longit holds on to that.

how long should your clients hold on to that beforethey request it from dns? the default with trafficmanager is 300 seconds. in seconds, that's 5 minutes. so, we can be up to an additional5 minutes before the client that's requesting this actually gets accessto it but the ttl is configurable. like i said, 300 is the default but you can lower that aslow as 30 seconds. so you can drop thatto 30 if you want.

the benefit is thesethings will update faster. the downside of that is you're gonnahave more hits that go to your dns, there is gonna be more latency asmore requests go through on that. you have to stripe thatbalance on what you wanna do. if you're looking fora quick easy way of, is there any way i canshort circuit this? there is, so i've set this up, you can set uptraffic manager work different ways. you can choose performance basedrouting or you can choose round

robin or you can choose weighted ifyou wanna weight things between two. you can cheat a little bit if youwant to if you're actively managing it by changing it to weighted and then just weighting the otherregion much higher. weight the other region 100, weight the region that's down to 0and then it'll switch over faster. but if you're not doing it,if you're letting the system just run happens at a time whenyou're not monitoring. it'll happen, it'll wait forthe ttl to end.

now, for the person thatjust got it a minute before it's gonna takethe full ttl time for someone who's just requesting,it'll happen instantly. so, realize thatthere is a gap there. the gap is gonna vary dependingon what the client is accessing. so it goes and updates dns. and, of course now yourclients can go through. what happens when itcomes back online? so, okay, you fixedthe problem the region's back,

your machine's back online,now what happens? well, like i said traffic manager's, just that littlebird keeps drinking. and so, it goes andit's always requesting, even when it's down,it keeps requesting. so, it does a request again, andthis time it gets a 200 code. it's back, andas soon as it sees that it's back, it doesn't try it multiple times,as soon as it comes back and it sees it once,it goes and it updates dns.

now, you still have the ttlyou need to think about, but remember that it's four triesthrough times 30 seconds for it to make the switch over,but it's only the first, as soon as it sees it back,it automatically updates it. so that's what happenswith traffic manager, and that's why it's a littledifferent than your standard dns. so. we've talked aboutthe different ways. performance is basicallyit's using latency and

seeing when someone connects whichregion, which of the endpoints that you've pointed out has the lowestlatency that it sends it to. priority is just,you can pick priority. in this case i picked the priorityto go to us west cuz that's where i had the multipleresources at and then, waited, just like i said you can choose. that's really useful if you wantto do things like ad testing. you want to have twodifferent versions. you want to send 50%of your traffic,

or 20% of your traffic to something,and see what happens. you can do that through dns. so, let me flip back toour demo station here and i'll reload this page. and you'll notice now that it'ssaying hello, server one east us. so, before we were in west 1 and 2. now we're in east. traffic manager has failedeverything that i basically just explained, that's what's happened onthe back end and this is running.

i didn't go and change anything butthe time we were talking was actually time that it wasfailing over on the back end and updating dns and the ttl wasexpiring just like we talked about. so that's what it looks like. like i said, high availability,not sexy, super important. now let's talk about storage. so, azure has four types of storageaccounts that you can set up, four different types of storage andhow to handle them. i'll go through them quickly becausei find that not everyone understands

what the differences are. then i'll talk about whatyou should be using and when you should be using it. so lrs,it's locally redundant storage. basically, think of it as,when the storage rack, you've got the equivalentof three different copies stored locally within that rack. rack goes down,your storage is offline. so, no particularone will be down and

we spend a lot of time making surethat we optimize for zero data loss. so you've got multiplecopies there or the equivalent there of,based on the technology we're using. but, your scope of what can failis basically at the rack level. you've got zone redundantwhich basically you've got three copies spread across twoto three different facilities. so, you've got the samenumber of copies but they're spread outa little more broadly. until you've got a littlebit more protection on what

you're doing there. and obviously, each of theseis basically a tradeoff on cost versus your data availability. the next level up from there isgrs or geo-redundant storage. in this case,we keep three synchronous copies, just like lrs basically, and thenyou've got three additional copies that are stored ina different region. so you got three more copiesthat are somewhere else and that's stored in a paired region.

we'll talk about pairedregions in a minute. but basically,they're somewhere else. so if an entire region is down, that even if that region burnedto the ground for some reason, you've got three copies thatare stored somewhere else. and then finally, you've got ra-grs, which is the same thingas grs storage except for that secondary copy that you haveis actually read accessible. and the easy way to testthis is just to any

account you have with this,if not just go create one. and take the url foryour storage account and just add dash secondaryon to the end of it. so, if your account namewas blob you could do, blob dash secondary andyou would see you'd be accessing the secondary copy ofyour storage there. so those are the differences betweenthose and they're true for tables, blobs, queues, files,everything that you do in storage, when you have a storageaccount you can set it up and

choose this setting foreach of them. now, the key take away,when you're thinking about storage, is you can gnore those. don't do them. now, that i've told you about themi'll tell you what you should do. use locally redundant storage,especially for your desks and your vhds. you're not gonna put your vhdsin something like read-access grs because you're not going to besyncing that data constantly and

your rack can spin them up anyways. you machines are local, you mightuse something like your data disks, have your data over,but the machine and the disks that you're runningthat's actually your c drive, it's different in somewindows to that in lrs. now, your data, the things thatyou want to be accessible, if you've ever had to do a failover, if you're ever doing a disasterrecovery scenario and you want to be able to access it,you want to access it on your terms.

an accident on your terms meansthat you need read access grs, because grs the data stored andit's there, but you can't access it. you don't get to choose when you getaccess to it, it's there for us to be able to restore it and make surethat we've held on to your data. we make you the promise that wedo everything we can to make sure that your data is secure and protected even if somethinglike a datacenter were to burn to the ground but we're theones who are going to restore it. if you wanna own that decision,

if you wanna decide when you'regoing to do that failover, when you're gonna copy it,you want read access to it, and that's what grs is good for. so think of using those two storagetypes as what you're gonna use for either your data on your disks. so we talked about pairedregions a little bit. i've also one more thing onthe storage is use the azure sdk fuse the azure sdk. it has this wonderful feature in itwhere it will check and if storage

is unavailable it will automaticallycheck the secondary region for using ragras which means you don'thave to write a lot of complex code. you just go, you use that api and when you go to access it, if it'snot able to access it it will automatically retrythe secondary region for you. it just makes it nice and easy. paired regions. every region in azurehas a paired region. paired regions are there to rideyou a number of different benefits.

i have a list here. sorry for the small text. this list grows and grows. we grew it this past monthseveral times and so it unfortunately gets larger but ifyou want to look at the active list, there's the url atthe bottom of the slide. well, this is important becauseit provides you isolation. when i say isolation, like each of these regionsare geographically separated.

s, you have the regions, andthey're at least 400 miles apart. there is an exceptionto that's japan. it's cuz it's very difficult toactually have two data centers that are 400 miles apart. but all the others of them? they're at least 400 miles apart, which means you're indifferent climate zones. that means, if you had a flood,if you had a hurricane, if you had a blizzard,you're protected from that.

you don't have a data center, you're like,my dr site is across the street! okay, are you actually ondifferent power there? if it flooded, are you just gonnaflood both of your data centers, so they're made sure they'rein different flood planes, sometimes even indifferent tectonic states, on different tectonic plates. so, it's about providingthat isolation and also, for finding the replication.

i talked about the ragrs,and that we have services that automatically go andreplicate some of your data, make your services easy toduplicate across those. it's also important to understandthat, if there's ever a large incident that happened in azure,there's a region order to things. and, the first that we try and dois make sure that we bring at least one of the region pairsin every geography. so, if you don't choose to beas part of a region pair in your geography, you can pick twodifferent regions and they both can

be the ones that came up second,which you probably don't want. you probably wanna comeup as fast as possible. making sure you're usingregion pairs ensures, that at least one setof what you've built, will be part of the firstthings that we bring back up. the same is true with updates. just like you have fault domains and update domains, there's the thingsthat are terrible to happen. there's also about whenwe're rolling out updates.

and we make sure that we're notdoing them in both regions, at the same time. so for any reason there was aproblem with one of the updates that was rolled out. you could be sure that oneversion of what you were running, wasn't impacted by thatupdate that was rolled out. versus if you picked two regions onyour own, they could be two regions and we don't make thatpromise across regions. so you can pick one where we rolledthem out both at the same time

there, and then you can be impactedby something that happens. and the final one isjust data residency. we know that many of you havecompliance needs around things, and being sure that if you wanna beable to have your data all stored in germany. or all stored in china or in thefederal government cloud, in canada, uk, all of these places. that we've got dedicated regions, where we guarantee to you that yourdata will stay within that region.

and if you wanna be able todo multi-region deployments, you can andstill stay within that geography, and still meet thosedata residency requirements. so, i talked earlier about there'salso, you've got to be aware, that you can have a singlepoint of failure with storage. everything can be a singlepoint of failure. the goal's always,remove single points of failure. when using one, always usepremium disks for production. dev tests you can use what you want.

for data tests go ahead anduse what makes sense for you. but for production machines,especially for your rs drives make sure you use premium storageas sst storage basically. and that's all about iapps, thatbeing shown about the other apps. another thing is you can put morethan one machine in the storage account, butyou have to be smart about it. so one, never put more than 40machines in a storage account. and if you're doing custom images, if you've built your own versususing the ones that we have,

don't do more than 20. the reason for the differencebetween the two of those, is we have a caching layer. and so the images that we've builtsit in the caching layer, and so not all that traffic actually goesand hits your storage account. but if you're building your owndhd's, then those are going to be custom and so you've got a differentnumber of how many you can have. and there's iops limits, andyou need to watch your iops limit. i was dealing with one customer, andthey were having problems accessing,

they had a lot ofavailability problems. there were failures sayingcolonel pan x, and blue screens. we took a look. and turned out theyhad 350 machines, all running out ofthe same storage account. well, okay, what happens? well, you've got this iopslimit in there so let's say, you know you're looking, make allthose requests, and they're not all going to get fulfilled when allthose machines are running.

so, if you're trying to accesssomething like a static asset, you know a picture or something,that's probably fine, it probably doesn'tmatter all that much. you know, you'll get some latencyon it, maybe it will come back. but if you're writing to say, your swap file,that's gonna be a problem. you'll blue screen on that. and that's what was happening. so make sure that youunderstand what's happening.

in general, think about spacingout your storage accounts. you can use them for servers thatare in different tiers of your application, but don't takethe same server, in the same tier, in the same storage account. again, because if you do that andfor some reason something happenedto that storage account, or whatever storage stamp thatstorage account was on, then the problem isyou hit multiple. that the idea of spreading it out onthe availability set we talked about

earlier, has just been negated bythe fact that you have a single point of failure on the storagepiece that when it goes down. all those machines go down,so spread those out. if you have a premiumsupport contract, you can contact support and they'll help make sure that youcan spread those things out. they'll take a look and make sureyou're in different storage stamps, that you don't have singlepoints of failure for that. they'll help you with that.

but i also want to show you a tool,so you can kind of figurethese things out yourself. you can look at that andyou can know, if you have these challenges or not. so, i'll do a quick demo here. >> and the next slide i bring up hasthe url on this, so don't worry, you'll get the url and all that. once again,my awesome hdml skills at work here. but this is a tool that'spublicly successful,

you'll be able to hit it. and you can just putin a storage account. so in this case, we'll do test. this is actually for those of you, that geek outover interesting architectures. this is actuallya serverless architecture. actually, this is a web page that'sserved out of a storage account. and then the processing actuallyhappens as an azure function. there's no web server that'sactually involved in this.

it's just kind ofa fun thing to build. but when you run this, whatyou're gonna see are two things. you obviously knowwhat region you're in. but you're gonna seea datacenter id, and you're gonna see a stamp number. and what you care is that, those two numbers are notthe same for something. if those are both the same,within a region? then what you have,

is you have those two machineshave a single point of failure. and if those stay in the same tier, then even if you might be spreadacross an availability set or had two separate machines, you nowhave one spot where that can fail. and this tool worksobviously public cloud but you could also use it government,china, germany and in needs public clouds,it also works with it. so, you can pokearound with that tool. if you like, just to kind of know,of course, you know, it's need to,

to poke at it. but if you're gonna do anything,where you really wanna do this and go through say,your whole infrastructure, you probably gonna accessit programmatically. so, the, you're, the top isif you wanna do, what do i. but you still write there, and i have access tothe actual azure function, that you can access directly. it's a publicly accessible endpointthat you can go hit and basically,

if you post, andi'll show what the json call is. make a json call in, specificallythe url for your storage account. and it'll return json blob, that hasall those different pieces in it. so, if you wanna go, i saw one person actually showedme the other day, they'll built a comparison tool where theytake two storage accounts in. and they tell you if you'vegot a single point of failure, between them or not. so, you can make calls to this, butthis allows you to look at things.

and so, if you wanted to say, hey i've got a singlepoint of failure here, maybe i wanna spin upanother storage account, make sure it's ona different storage stamp. and then put my vhds there, when i'm spinning something upto make sure i don't avoid it. this is a way that you could do it. i do have to call out that thisisn't a tool that's officially supported by microsoft.

it's just something that i built. but it's up there running. and you could playwith it if you like. i just wanted tomake that available. cuz i think it's far better foryou to know, what's going on with your resources,and what you can do than not know. so let's talk aboutcommon architectures. so databases,i mentioned databases before, we gotta go through the differentconsistency models,

how you think about them andhow you design for them. so my guess is everyone here, is probably familiarwith rdbms sql databases. right? this is like old hat, everybodyknows this, quick check for a weakness, yes, good. so this provides youstrong consistency. this is great for things like financial transactions,it's a great example.

the quintessential example ofthis is, you do not want to be replicating your sql databasebetween let's say like shanghai, london, and seattle. and you're gonna synchronize them,and then someone figures out, hey i can do this transactionwhere i can make a request to say, withdraw a $5,000 ortransfer it to an account. and if i submitted it to all ofthem at once, they're not actually, they don't talk to each other. they're on the slow update,between them.

they're asynchronous updating. and so they all say,that's a good transaction. go make it happen. sql databases help youprotect against that. cuz you can set it up to make sure that your datais strongly consistent. you can make sure that whena transaction goes through, the other onesare queued up behind it. now the downside to that,is there's latency with that.

if you're gonna copy your dataall around the world, and you're gonna make sure, and thenverify that that data copy happened. and you're going to queue up allof the transactions that happened behind all of those, that's goingto slow down your processing. and so, although you maysay hey you know what, for large money transfers that seemslike a perfectly good trade off. that may not seem like a perfectlygood trade off for something, like say an ad server. where you might wantsomething moves super fast.

so, it's important to understandthe benefit you get from strongly consistent, so all your data will always bethe same across all of them? but there's a latency hit dependingon how far apart they are. so with sql server, anything aboutthings like using always on, the thing to remember withthat is latency envelope. so six milliseconds,latency envelope for asynchronous. always on for sql server beyondthat, so if you're doing, say, between regions or

around the world, you're gonnabe in asynchronous replication. and asynchronous means they're notall the same at the same time. so, next step on that is takinga look at something like, you know, no sql databases. so these are eventually consistent. they're on the otherend of the spectrum. what it says is you'regonna be super fast. things are gonnahappen really quick. that's great.

there's no guarantee that whatyou're getting is actually the latest data, and that it'sall consistent amongst them. so this might be good forlike games, leaderboards. things where people wannasee updates right away. but it doesn't really matter, if there's a consistencychallenge between the two. the data will meltitself in a few seconds, and that's totally okay forwhat you're doing. and then of course there'sa middle ground between the two.

so document dbs is a good example. and this is aboutsession consistency. so, this would be something likethink about if you put a post up on facebook or linkedin. you wanna see that you putthat post up right away. if you put a post up andthen you hit done and submit, and nothing shows up,that's gonna look weird to you. and so it makes sure that withinyour session, it's there, it's consistent.

but what it doesn't guaranteeis that that's synced to everywhere else. that will happen later. and so your friends' feeds, mightnot be updated for a minute or so. and if you post a new picture, orif you've gone and done an update, that's probably okay. but these are the threedifferent models. and you handled thema little bit differently. so it's important to understandwhich of the ones you're gonna

wanna choose andhow you do the replication. because replication for thingsthat are eventual consistent, or session consistent are much easier. because you can replicatethose around the world. you don't care. it's not a financial transaction. it's not gonna impact you that much. versus things that are dbms,like sql. you do have to make sure,

that you're making thosedecisions around the impact there. so let's take a look in ourarchitecture what that looks like. this is your standardthree-tier architecture, and you've got the three tiersthat we talked about. you noticed we also puta distributed cache here. this is really about performance andscalability, distributed caching in azure,use something like redis cache. and that basically isan in-memory database, and so you're not hitting your sql servers,

you're freeing up those resourcesin order to be more scalable. and so, this looks like what you'veprobably been doing for years. so, i won't spenda ton of time on it. i wanna talk aboutmulti-region deployments. so, before i talkedabout active/passive. so remember, the passive ones and i'll show youan example in a second. we'll bring it back up. but basically, you've got oneversion that's sitting in one

region and that's the onethat you're using actively. that's the one that's at scales thatyou're running, that all your users are being trafficked to, butthen you got traffic manager. so it's aware that there isa second one out there and you're not using it, but it's justgot a few resources that are up and running that sit there just incase you need to fail over to it. so it's really one region with somefailover and you're doing that by running things that are reserved,that you're reserving you capacity. now, active/active gives youmuch stronger environment.

when you failover, remember you're going to scaleup the second environment. active/active is already scaled up. you're already runningactively in both environments. so a little bit moreexpensive to run that, but the biggest challengeis to run you data layer. because again,if you're running active/active, you need to think aboutthe concurrency model for your database andwe talked about on the last slide.

and the more you're using thingslike sql databases, the more you need to think about what'sthe latency impact to your users. you don't need to worry about thatwhen you're doing active/passive, you do need to worry about that ifyou're gonna try an active/active design and then active performanceis kind of a hybrid of the two. it's basically usingtraffic manager where you're taking a look at the latencybetween the two and you're basically runningactive/active, and then routing users to either location rather thanjust one location based upon where

they're located to make sure thatthey get a better performance. so, oltp's onlinetransaction processing. so, here's an example of one ofthe setups that's been setup. and you'll notice the giveaway i talked about before, a synchronize betweenthe two regions, between sql. so, that means you'retalking about active/passive. so, you got one version that'salways up that's running. the other ones thereavailable to startup, if you need to do ifyou failover quickly.

if you're doing active/active,you notice that everything basically looks the same, butwe've changed what the databases. we've now moved to one of thosesession consistent databases. so that on document db to make surewe have the session consistency, we can sync that over andwe can run active/active. different database model. be able to have thatactive/active design. so, quick examplewhere we've done this. i worked for the solutionarchitect to work on the 2016

democratic national convention. that website was actuallysetup on an azure. setup to be highly available,set it up to be across region. really, high scale as you canimagine, politics this time of year. there's a huge amount of loadon that and six months from now, the website will probably be gone. so, really great for cloud interms of the ability to scale. you can see at the bottom, i've gota chart here shows you the real traffic chart that we saw onthe website and we were hit with

a ddos attack on it, on tuesday andwe handled that ddos attack. because we designed something thatwas highly scalable that was spread across region, that could handlesomeone sending tremendous amount of traffic in data and be ableto handle that without worry. so what that architecture looks likei've apologize for the ichart here, but this is what real architecturelooks like when you build them out. i'm gonna ignore the security staff,simply because we don't have time to go through that, but we'reusing actual cdn the top of this. using cdn for ddos protection,be able to scale those pieces out.

this happened to be built on linux. we're using an ubuntu server,web server. we had a number of those. again, availability set sittingbehind a load balancer. hopefully, you're noticing thepattern in these and then mariadb. so, mariadb with those who aren'tfamiliar is i know sql database built for scalability. so we use that as the databasethat sat behind this and then there's like i said,there's some jump boxes and ad and

key vault andthat's really security stuff. if you're curious, i would love totalk to you about it afterwards, but it's not tied to the highavailability stuff that we want to chat about today andwe duplicate it across regions. and so you could have thisavailable multiregions. so if something impacted one region,we're available on another. well, we're done withthe events when event was over. one of the regions will shut down. there's no reason to runmulti-region in this

particular instance. it was about making surewe had the availability for very high scale event to happen. so, i wanna gonna talka little bit about cdns here. we mentioned them earlier aboutbeing sure that you put stuff as close to people as possible. cdns, they're greatprotection from ddos, but i wanna talk to you aboutthe availability piece like i said, there is security stuff in there aswell about obfuscating your origin.

but really, this is about offloading a lot of the resources that getsoaked up from your machines. so when peopleare hitting your machine, like they're hitting the machine themachine has to process that stuff. when you move things to a cdn, ittakes a lot of that processing and that moves it off of your machine. that means that your machinesare more scalable without having to actually buy bigger machines. it also means that it's availableeverywhere, because cdns.

so, azure cdn gives you two options. you can use verizon oryou can you akamai and they have literally thousandsof spots all around the world. and when people make a connection, they're gonna makea connection to the cdn first. get all the data theycan from the cdn and then they'll hit your origin,then they'll hit your server. so, it takes a ton ofload off of your server. so, i want to give you a quickdemonstration of what that looks

like and why it matters. so, i'm gonna go andopen an incognito window here. the reason i'm doing incognito isbecause i've hit these pages before. and otherwise,i just playing out of the cache and it wouldn't show you anything,although it will be remarkably fast. so, what you're looking at here isa web page that has three iframes in it and that the three iframesall served off the same server. everything here is served offthe server that i've got in west us. so, it's got a server.

it's serving these pages up. and basically,if i click on each of these, it's just gonna load a page from it. so it's got a timer and basically,just times how long it takes and tells you that. it's a set of various monkeys. if anyone recognizes naru inthe picture, it's awesome. he's the monkey that took theselfie, is the one in the top right when some guy left the camerain the forest left it and

the monkey actually tooka picture of itself. so that and a bunch of friends thati have, and so that took a little over 10 seconds to load, andthat's just a timer that's running. like i said, the server there,i put the pictures on it. it's actually this is alllive you can hit it on your machine, if you want. the only thing to be awareof is i chose six pictures, because that's the limit of how manyconcurrent connections that your web browser will allow.

so i didn't wanna be impactedby a browser saying, okay, i'm gonna delay certainnumbers of them. and the pictures, although theyshow up small in this example, they're actually eight megapixels. megapixel pictures and i haven'tdone anything like g zipping them, and the reason i did that i put iton a small instance type is because i wanted to show you the impact ofputting both on a server from one machine rather than having to spinup a whole bunch of machines have them all hit it toshow what load meant.

so i've just scaled it down,you could scale it up. you'll have the same thing. you'll just have to hitit with more people. so when i hit this one,pulling it out of a storage blob. this is the same thing. same web page comingup the same server. the only difference is i'mtaking all the pictures and rather than serving those picturesoff the machine where i'm having to run the machine to process them,i've moved them out to blob storage.

so a great scalability choiceis to take your static assets, anything you know that's static. put them in blob storage and then just point to the blobstorage where you pull them out. so, you can see it'sless than half the time. it's the same page. i just moved the assetsout to storage. it will not only be cheaper andhelp you scale better, but also you notice the time difference?

there is one trick here that i wannapoint out, which is the web server here is in west us, but the picturesthat you see loading for that second oneare actually in west india. so even coming halfway around theworld, it's still half the time for your user experience on these. but of course, we can do it onebetter and i'll show you the cdn and that's real live load there. so, that's the same thing. [laugh] thank you.

of what i did basically isi just took azure cdn and i pointed it at that same server. and so what's happening is ratherthan having to go to west us and pull it, it's actuallycontacting azure cdn and pulling those static acids down. so, you can have a load timethat is literally orders of magnitude difference foryour customers. like i said, this is real codeif you want to take this url. get it on your machine,you'll see the same thing.

just don't do it on your conferencewi-fi, because what you will really be testing is how overloadedthe conference wi-fi is. you'll see huge numbers andit has to do with this in the wifi, it does nothing todo with the network. but if you get into a hard line,try it, you'll see it. so, that's the kind of availabilitychallenges that you can see using a cdns not only do you getthe security benefits of it, but you'll also can createa lot more availability and take a lot of loadoff of your servers.

so, redo the demo and the last thing that i want to talkabout today is fault injection. so remember, when i said trust andverified at the beginning? so how do you verify it andthe answer is you test, always test. it's like one of those habits thatgets drill into you and i'll admit that my early days in engineeringactually started as a tester. and so, this is close to my heart. the great tool out therecalled chaos monkey. it's created by netflix.

they designed it to work with theaws, but you can convert it over. you can use with azure. there's all about fault injectiontools out there, as well. i don't have any particular toolthat i recommend over another. you pick whichever one is right foryou. what's really more important is thatyou're doing the testing not which one you're choosing to do it with. if you actually wanna kind of tryout what it's like to be your own chaos monkey, we built somethingthat's down on the show floor.

if you come by our booth where youcan play a game that we built in the game you go andyou shut off servers, you've accessed servers thatare flying around the screen. and the little trick to it is allthe servers that you shut down in the game, you're actually shuttingdown the actual servers that are running the game. so, you can go and try and create downtime by trying totake it down within the game. we did that to showcase some of theavailability stuff and i'd love to

show it to you all, but you getto play the chaos monkey there. it really doesn't matter whatyou're doing, just go in and test. these are the things that helpyou find the, remember i said, if you take nothing else away, take away that singlepoints of failure are bad. you need to get rid ofsingle points of failure and this is a great way to findthose single points of failure. go in and do fault-injection. go in an just shut things down,see what happens.

if you're afraid to do that inproduction, realize what you're telling yourself about how brittleyou feel your infrastructure is and that's okay, you might not be there. setup a test environment andgo and do it, but understand what that means. when you're setting up that testenvironment, you should be using that automation enscripts we talkedabout earlier, because then you can be sure that you had the exact samething that you had previously. so use that automation to set it up.

go ahead and hack at it. find where these things are, becausethat's the way that you're gonna make sure that you set these thingsup, and you get better at it. you make sure that you know thatyou're indeed fault-resilient. you can say that you are, butuntil you've actually proved it, that's when you actuallyknow that you are. so, let's wrap it up. so we talked about designing forhigh availability. we talked about understandingbest practices and

common pitfalls that are in it. and i talked to you a little bitabout some of the inner workings of how we built some ofthe azure stuff itself. if you want more information, here'sa bunch of articles we put up, and there's more available. it's all available inthe slide deck, by the way. and the slide deckwill be available, so you can download the slide deck. and these pieces are up on azue.com.

if you download the slide deck, asa little added bonus, at the end of the slides i actually added a slideon high available check list. so, 15 things to go andlook at in your infrastructure, and see if you have them. make sure whether you have them oryou don't so you know are youmissing bets practices. so, it's a quick easy testthat you can go and do. so, also related to other sessionsi mentioned before some of you are interested in disaster recover,or other parts.

some of these havealready happened so you can look up the recordingsonline, or in the app. some of them are yet to come,the ones closer to the bottom. so if you're interested in someof the disaster recovery pieces, remember i said this is abouthigh availability resiliency goes to the other end with back up anddisaster recovery. there's some sessions you cango to to learn some great things about those pieces. and please, when you have time,fill out your evaluation.

it's how i know whatyou wanna hear from us, what we can do to be better. microsoft takes timeto fly me out here so that i can talk to all of you andit really is important. we do it to make sureit's better for you. it's not so i can be on a stage. it's so that we can help you learnwhat you can do to be successful in your applications andwhat you do in your enterprise. please take the time to fill out.

it helps me get better. it helps me provide betterinformation to the people who come to ignite next year. we appreciate that. there's also some additionalresources that are available just general for it pros. and for those of you who haveto run to your next session. i will say thank you very much forcoming. it was an honor to present toyou and i appreciate your time.

>> [applause]>> thank you. [applause] and with that,i've got a few minutes for q&a. so, if you want to come to any ofthe microphones i'd be glad to answer those otherwise, if you can'tmake it i'll be down at the booth and be happy to answerquestions there as well. >> hey there. >> hi.>> in all of your great diagrams there was still some pointswhere there was a single image, and that was of course the azure loadbalancer, which is highly available.

>> mm-hm.>> do you have any insight how they architected thatto be highly available? did they use round robin, orsome of these other techniques, or is that secret saucethat we can't talk about? >> so, i can't talk aboutwhat the actual underlying architecture of that is. what i can tell you is that we'vedesigned it to be highly available so there's no single point offailure within the load balancer service to make sure thatyou're not hit with that.

>> so if we're using two loadbalancers on two separate layers of our application we don'thave to worry about those both going down at the same time oranything like that. region wide, or-->> there's always if storage went down, orif the region was hit by a meteor, like i can never say never>> [laughs] >> but i can say that it's designed for highavailability within that region. and then if you need to be a crossregion then you would at something

like traffic manager to loadbalance it to a different region. thank you very much. sorry. >> appreciate it.so for availability sets. >> yes.>> are they specific to premium storage only oris there availability sets for other types of storagebesides premium. >> soavailability sets are available, regardless of the typeof storage that you use.

that premium storage is recommendedfor your production work loads, for running your system drives on. you'll just get better performance,better io, more consistent io, but availability sets are availableregardless of what kind of storage you're using. so even if you're doing your devtest workloads, you're using those on standard storage you shouldstill use an availability set, yep. >> should you use, so say for example vm's right if youhave availability set can you put

the same vm's in the samestorage account or it it best to have the availabilitystats between two storage accounts? >> so, within the availability stat,you should always make sure that your vms are in differentstorage accounts. because those are the sametier of application and if you put them inthe same storage account, you're guaranteeing that youhave a single point of failure. so within the availability statwhich by definition is the tiering your application.

you absolutely want them indifferent storage accounts. >> okay. cool. thank you. >> thank you very much. >> [inaudible]

0 comments:

Post a Comment

 
Copyright Lawyer Refferal Service All Rights Reserved
ProSense theme created by Dosh Dosh and The Wrong Advices.
Blogerized by Alat Recording Studio Rekaman.