DELL PowerEdge R730 Improves Boot Times


The DELL generation 13 servers are blazingly fast and capable servers. That’s has been well documented by now and more and more people are experiencing it themselves. These are my current preferred servers due to the best value in the market for hard core, no nonsense, high performance virtualization with Hyper-V.

They also have better boot/reboot speeds than the previous generations with UEFI.  We noticed this during deployment and testing. So we decided to informally check how much things have improved.

Using the DELL DRAC8 We test the speed form Windows Server restart …

image

… over the various boot phases …

image

… to the visual appearance of the logon screen

image

So now let’s quickly compare this for a DELL PowerEdge R720 and a PowerEdge R730. Bothe with the same amount of memory, cards, controllers etc. None of these servers had VMS running or another workload at the time of restart.

For the R720 this gave us:

image

and the results for a Windows initiated server restart on a DELL PowerEdge 730 with EUFI boot is:

image

This was reproducible. So we can see that we EUFI boot times have decrease with about 30%. I like that. You might think this is not important but it adds up during trouble shooting or when doing Cluster Aware Updates of a large 16+ node cluster.

Now thing are beginning to look even better as vNext of Windows has this feature call “Soft Restart” which should help us cut down on boot times even more when possible. But that’s for another blog post.

SMB Direct With RoCE in a Mixed Switches Environment


I’ve been setting up a number of Hyper-V clusters with  Mellanox ConnectX3 Pro dual port 10Gbps Ethernet cards. These Mellanox cards provide a nice amount of queues (128) for DVMQ and also give us RDMA/SMB Direct capabilities for CSV & live migration traffic.

Mixed Switches Environments

Now RoCE and DCB is a learning curve for all of us and not for the faint of heart. DCB configuration is non trivial, certainly not across multiple hops and different switches. Some say it’s to be avoided or can’t be done.

You can only get away with a single pair of (uniform) switches in smaller deployments. On top of that I’m seeing more and more different types of switches being used to optimize value, so it’s not just a lab exercise to do this. Combine this with the fact that DCB is an unavoidable technology in networking, unless it get’s replaced with something better and easier, and you might as well try and learn. So I did.

Well right now I’m successfully seeing RoCE traffic going across cluster nodes spread over different racks in different rows at excellent speeds. The core switches are DELL Force10 S4810 and the rack switches are PowerConnect 8132Fs. By borrowing an approach from spine/leave designs this setup delivers bandwidth where they need it a a price point they can afford. They don’t need more expensive switches for the rack or the core as these do support DCB and give the port count needed at the best price point.  This isn’t supposed to be the top in non blocking network design. Nope but what’s available & affordable today in you hands is better than perfection tomorrow. On top of that this is a functional learning experience for all involved.

We see some pause frames being sent once in a while and this doesn’t impact speed that very much. It does guarantee lossless traffic which is what we need for RoCE. When we live migrate 300GB worth of memory across the nodes in the different racks we get great results. It varies a bit depending on the load the switches & switch ports are under but that’s to be expected.

Now tests have shown us that we can live migrate just as fast with non RDMA 10Gbps as we can with RDMA leveraging “only” Multichannel. So why even bother? The name of the game low latency and preserving CPU cycles for SQL Server or storage traffic over SMB3. Why? We can just buy more CPUs/Cores. Great, easy & fast right? But then with SQL licensing comes into play and it becomes very expensive. Also storage scenarios under heavy load are not where you want to drop packets.

Will this matter in your environment? Great question! It depends on your environment. Sometimes RDMA is needed/warranted, sometimes it isn’t. But the Mellanox cards are price competitive and why not test and learn right? That’s time well spent and prepares you for the future.

But what if it goes wrong … ah well if the nodes fail to connect over RDAM you still have Multichannel and if the DCB stuff turns out not to be what you need or can handle, turn it of and you’ll be good.

RoCE stuff to test: Routing

Some claim it can’t be done reliably. But hey they said that for non uniform switch environments too Winking smile. So will it all fall apart and will we need to standardize on iWarp in the future?  Maybe, but isn’t DCB the technology used for lossless, high performance environments (FCoE but also iSCSI) so why would not iWarp not need it. Sure it works without it quite well. So does iSCSI right, up to a point? I see these comments a lot more form virtualization admins that have a hard time doing DCB (I’m one so I do sympathize) than I see it from hard core network engineers. As I have RoCE cards and they have become routable now with the latest firmware and drivers I’d love to try and see if I can make RoCE v2 or Routable RoCE work over different types of switches but unless some one is going to sponsor the hardware I can’t even start doing that. Anyway, lossless is the name of the game whether it’s iWarp or RoCE. Who know what we’ll be doing in 5 years? 100Gbps iWarp & iSCSI both covered by DCB vNext while FC, FCoE, Infiniband & RoCE have fallen into oblivion? We’ll see.

Dilbert Life Series: Mediocrity Kills aka Show Me Your Strategy Or Be Doomed


Disclaimer: The Dilbert® Life series is a string of post on corporate culture from hell and dysfunctional organizations running wild. This can be quite shocking and sobering. A sense of humor will help when reading this. If you need to live in a sugar coated world were all is well and bliss and think all you do is close to godliness, stop reading right now and forget about the blog entries. It’s going to be dark. Pitch black at times actually, with a twist of humor, if you can laugh at yourself.

“Some men are born mediocre, some men achieve mediocrity, and some men have mediocrity trust upon them.”
― Joseph Heller, Catch-22

I don’t do mediocre. There, I said it. I only do good to great. Well sort of Smile.  The point is that no matter how good you are, you still mess up. While perfection is not of this world it doesn’t look too great on my résumé when I have to write “As a real team player I collaborated enthusiastically to achieve mediocrity”. Sure I might cover it up with fluff like “I integrated the lateral dynamics of horizontally deployed technologies across a vertically integrated stack to realize an optimal use of resources exposing their inherent value to the business while leveraging the synergies of the cloud”, but I won’t. image

As no one likes to be mediocre we sometimes see creative attempts to make sure we all pass the bar but we won’t discuss that here. Whilst every organization will have its share of mediocre processes, way too many are mediocre as an entire organization.

Indicators of mediocrity

Claiming to be innovative

Avoiding mediocrity is not about being original or “innovative” all of the time. Quite the opposite! Sometimes not being mediocre means using plain good commodity solutions that are great for the issue at hand. The good old 80/20 rule, “good enough is good enough” & commoditization delivers the best value for money here. Don’t spend vast amounts of money and time on custom or “boutique” solutions when a commodity will do. This has secondary benefits as well. That time and money can be used for some custom or creative design & work on the things that do matter a lot and make a big difference.

Groups providing false security

For some reasons mediocrity tends to flourish more often in groups and committees. I see this way too much. This danger of sliding into mediocrity exists as an individual but it seems to become more prevalent in a group or organization. Some of my peers call the “this the race to the bottom”:

”Mediocre people working for mediocre organizations delivering mediocre results”

Nobody wants to be that way, it just turns out like that. It has many reasons. The Peter Principle, The Dilbert Principle, B People hiring B people, human behavior in an environment where it’s wiser to conform & play politics than to get results etc. Don’t underestimate the group pressure to conform, avoid mistakes, be a team player or a “can do” person. And then there is the desire to avoid responsibility. Which also happens to be easier in group. The bigger the group in a meeting the bigger the risk of this, a group enforces indecisiveness & caters to fears.

Some organizations tolerate and even reward mediocrity. Management lead by example, whether they like it or not. The effects of this can be partially hidden and mitigated by real leadership in the group (competent employees, highly skilled external help), but it cannot be stopped. If management doesn’t care, they can’t expect others to care. If managers talks about team work & going the extra miles but don’t do so themselves, things break. If the need for safety, fear for failure or not looking good is what drives them you won’t progress & see success. Success cannot be bought and you can’t lead from behind.

Mediocre groups can be manipulated quite easily. “Politicians” like this. It’s like water following the path of least resistance. By leveraging the group you make them accomplices and they can’t complain about decisions made over their heads. Some (most) probably know all to well that they are being manipulated, but why struggle if there is no benefit in it? It safer to conform a when risk aversion sets in, great ideas die. Here’s a beautiful summary (thanks to Kathy Sierra):

Riskaversion2

Avoiding reality is game we all play to some extent. The abuse of best practices, methodologies and such by clinging to time like a life craft or actually thinking that following the bullet points will magically result in stellar results. This leads to needing ever more resources for ever diminishing returns on investment. The organization becomes an overly complex entity where avoiding responsibility is a top priority and perception is everything. ITIL done wrong will achieve exactly that. It drains the all the fun out of work, and grinds progress to a halt. But no one is to blame as all rules where adhered to. Risk Avoidance As a Service (RAAS™).

Personal note: The power of a group lies in the excellence of the individuals and their ideas. Harvesting those to create the best possible solution is far from conformity to different points of view. It’s about leveraging the discussions, the different or opposite points of view to come to better solutions. In this respect I find the view that “people should learn to do what they’re told” misguided, dangerous & counter productive.

Who’s managing and who’s leading, if anyone?

It doesn’t take very long to walk into a group and observe who the real leaders are. Often these are not the people with the rank, title, mandate. In a lot of cases they are very different persons. This might sound great as a fail safe, but there’s only so many wrongs bottom up approaches can prevent or mitigate, let alone solve. “Bottom up” can only do so much.

This isn’t surprising as middle management is used a dumping ground for people they can do without in critical functions and are willing to sell their souls for the illusion of advancement. They often become a burden to employees & progress.

Now employees do notice this and it ruins trust. Sure you can blame the culture and bad attitude but hey when the team or the organization fails it is their fault and their responsibility. No this is not to harsh. They are all to eager to claim higher wages & ownership of success. Well that knife has two edges and you can’t blame it on the culture. You get the culture you cultivate Smile. Those that can’t handle that responsibility are the ones to fail as managers & most certainly as leaders. You cannot complain to your subordinates as a managers. Shit flows down, gripes flow up. Go it?

Read The Dilbert Life Series – A Bad Manager’s Priorities. Your personnel already has enough crap to deal with, just like you. Don’t add to it. Not that employees can’t be total fools and pains in the proverbial behind but hey, I have posts on that to.

Strategies, Tactics & Execution

Mediocrity is seen where real strategies, tactics & execution are missing. They just do or buy stuff, often without any understanding of the ecosystems they operate in and the relations between them. Their situational awareness is zero and that’s deadly. So we have “managers”, “architects”, “analysts”, both in house and consultants, that cannot even explain what a strategy is. They might claim or believe to have one, but they don’t. It’s opportunistic actions towards the flavor of the day. Such an organization is doomed for mediocrity and survival is by chance, not skill.

Who’s to blame?

Most people just try to survive or perhaps get ahead to a nicer job and/or a better paid one. But no one will admit to it on a performance review, so we have institutionalized lying. At best you’ll get justifications when you ask, but no real explanations. It’s not just as simple as managers being stupid or lazy. When it comes to strategy many are playing a game they don’t understand, let alone master. They are out of their depth and as such they are bound to lose. They’re being used.

However it’s very in vogue to blame the lack of Business – IT alignment for the woes in these volatile IT times. The problem is not IT or the business. It is the entire organization that allows for mediocrity. Sure you read that “IT is an old school ivory tower” all over the internet and it has to prove it’s value.  It’s pure management failure who don’t seem to know who does what and why in their organization. The division is purely artificial. It’s man made and kept alive as it serves political, personal & careerist agenda’s. Book authors, coaches & business consultant smile as they collect their fees discussing this at length. Welcome to mediocrity and failure. You have exactly what you have built.

consultingdemotivator[1]

Nobody has any incentive to fix it either. There is good money to be made and job security to be had by prolonging the problem on both sides. Are these people to blame if some one keeps paying them for that? These woes are true both in the private and in the public sector. Bar some minor detail differences in buzz words they all get handled by the same players. These are the ones that deliver the lobbyists and advisers that turn out ever less services for ever higher costs. They sell “solutions”. One size fits all if possible. Gartner makes a killing from this situation and they do have a clear strategy for that.

No IT strategy? No map? You’re doomed, indecisiveness will kill you.

FSCN0508_thumb[1]

If you don’t map out your game on the field you play on you can have no strategy. Without that you just do stuff. At best it’s functional (which is an achievement by the way) but often not. Planning, methods, tools … al of these fall victim to indecisiveness. So execution becomes impossible.

Here the result of decisiveness & purpose of action. You create green waves. When all the lights are green, you can ride the green wave. No starting, stopping, but a fluid highly effective way of moving ahead towards your target.

image

 

You’re not always in that situation and the light will turn orange & red along the way. That’s live and it’s not too bad unless you get caught in deadlock traffic jams during rush hour.

That situation requires a solution as it’s stressing, frustrating and detrimental to achieving your goals. In extreme case the time between the colors becomes shorter and shorter and eventually drops to zero …

There is another form of deadlock. Doing everything for everyone at the same time to avoid making choices. All the lights are on, on all sides, at all times. You do not get a clear signal or guidance.

TRafficLightsElsDeventer

Indecisive action kills or grinds you to a halt. Whatever the case you’re losing time and fail to reach your goals. Either by doing everything for everyone at the same time or by being stuck being in a mess. Game over.

SMB Direct: Choosing A Flavor


I often get asked what to buy for implementing SMB Direct. It’s a non trivial question actually and I’m not an expert, nor do I play one on TV.  All joking aside, it’s a classical consulting answer: it depends. I don’t do free consulting in a blog post, even if that was possible, as there are many factors such as the characteristics and futures of your organization. There’s also a lot of FUD & marketing flying around. Basically in real life you only have two vendors: Cheslio (iWarp) and Mellanox (Roce/Infiniband). Hard to say which one is best. You make the best choice for your company and you live with it.

There is talk about other vendors joining the SMB Direct market. But it seems to be taking a while. This is not that strange. I’ve understood that in the early days of this century iWarp got a pretty bad reputation due to the many issues around it. Apparently offloading the TCP/IP stack to the NIC, which is what iWarp does is not an easy endeavor. Intel had some old Net card a couple of years ago but has gotten out of the game. Perhaps they’ll step back in but that might very well take a couple of years.

Other vendors like Broadcom, Emulex & QLogic might be working on solutions but I’m not holding my breath. Broadcom has DCB and has been hinting at RDMA in it’s NICs for many years but as of the writing of this post there is nothing functional out there yet. But bar the slowness (is complexity slowing the process?) it will be very interesting to see what they’ll choose: RoCE or iWarp. That choice might be the most public statement we’ll ever see about what technology seems like the best bet for these companies. But be careful, I have seen technology choices based on working/living with design choices at at another level due to constrictions in hardware & software that are no longer true today. So don’t just do blindly what others do.

Infiniband will remain a bit more of a niche I think and my guess is that RoCE is the big bet of Mellanox for the long term. 10Gbps and higher Ethernet switches are sold to everyone in the world. Infiniband, not so much. Does that make it a bad choice? Nope, it all depends. Just like FC is not a bad choice for everyone today, it depends.

Your options today

The options you have today to do SMB Direct are rather limited and bound to the different flavors and their vendor. Yes vendor not vendors.

  1. iWarp: Chelsio
  2. RoCE: Mellanox (v2 of RoCE has brought routability into the game, which counters one of iWarps biggest advantages, next to operational ease but the no fuss about DCB story might not be 100% correct, the question is if this matters, after all many people do well with iSCSI which is easy but has performance limits).
  3. Infiniband: Mellanox (Qlogic was the only other remaining one, but Intel bought it form them. I have never ever seen Intel Infiniband in the wild.

Note: You can do iWarp (and even RoCE in theory) without DCB but in all realistic high traffic situations you’ll want to implement PFC to keep the experience and results good under load. Especially the ports connecting to the SOFS nodes could other wise potentially drop packets. iWarp, being TCP/IP, will handle dropped packets but possibly at the cost of deteriorated performance. With RoCE you’re basically toast if you lose packets, it should be losses. I’m not too convinced that pure offloaded TCP/IP scales. Let’s face it, what was the big deal about lossless iSCSI => DCB Smile I would really love to see Demartek testing these things out for us.

If you have a smaller environment, no need for routing and minimal politics I have seen companies select Infiniband which per Gbps is very cheap. Lots of people have chosen iWarp due to it simplicity (which they heavily market) and routability. The popularity however has dropped due to prices hikes that came with increased demand and no competition. RoCE  is popular (I see it the most) and affordable but for this one you MUST do at least PFC. DCB support on switches is not an issue, even budget friendly DELL PowerConnect N4000 series supports it as did it’s predecessor the PC8100 series. Meaning if you have bought switches in the past 24 months and did your home work you’re good to go. Are routability and distance important? Well perhaps not that much today but as the trend in networking is heading for layer 3 down to the rack which will be more acceptable when we see a lot of the workload goodness in hypervisors (Live Migration, vMotion,yes there is work being done on that) being lit up in layer 3 it might become a key feature.

More Tips On Dealing With Removing Short File Names When Migrating To a SMB3 Transparent Failover File Server Cluster


You might have read my blog posts on the capabilities and the process of migrating to a Transparent Failover File Server. If not, here they are:

These are a good read with some advice from real world experience and in this post I’ll offer some more tips. I’ve discussed the need to disable and get rid of short file names in my blog and offered other tips to prepare for your migration and get your file share LUNs in tip top, modern shape. But what if you run into short file name issues where you can seem to get rid of them?

Well here’s 3 more things to check:

1) Get rid of the shadow copies used for Previous Versions

The reason you’d better get rid of them is that they can also contain short files names & way to long path or file names. We don’t want them to ruin the party so we remove them all by disabling shadow copies on the LUNs to be copied. We can enable them again once the LUN is up and running in the new file cluster.

2) The logs indicate there are short file names you don’t have access to

If the NFTS permissions on the folder & file structure are OK you should not have to much problems bar some files being locked by being in use. Rerunning the fsutil command prior to migrating with the server service stopped will prevent any connectivity and use of file shares by people ignoring the request to log of or shut down their clients or automated jobs that otherwise keep accessing them.

But you might still get some indications in the log file(s) that state you can remove certain file names.

image

There is the good old trick of running your command under SYSTEM. That those the job! That helps get rid of short file name instances of folders where you normally don’t get access to. If system has rights you’ll be fine whether it’s a system folder or not.To do this the Sysinternals tools come in handy once again. You can launch a command prompt running under the NT AUTHORITY\SYSTEM account using psexec.exe by running the following from a elevated command prompt:

psexec -i -s cmd.exe or psexec  -s cmd.exe

image

The-s switch runs the remote process in the System account. Psexec temporarily installs a service "psexec running psexesvc.exe" on the remote computer (or locally if that’s what you doing) which is removed when the app or process that’s running is closed. It’s obvious now I hope why you need an elevated command prompt to run this command.

Now should you do this by default? Nope. Just when you need to and as always have a realistic backup plan, a way to recover when things go south.

3) Anti virus sometime prevents the removal of short file names

Disable Anti-Virus, sometimes it holds a temporary entry in the registry for the file involved. At least that’s what I’ve seen as a transient issue in some of the large number of logs I gathered. Yeah, I ran a lot of fsutil against large NTFS volumes. What can I say. Due diligence pays off!

4) Run ChkDsk

Just make sure the volume is healthy and no repairs are needed. If your migrating from and older file server there might be outstanding issues and a check disk on volumes with lot’s of files take time. Some of the ones I’ve dealt with had more that 2 million files on a 2TB LUN and it it can take 24 hours. Fun when you have 10 LUNs :-/

Dilbert Life Series: The War For Talent


Disclaimer: The Dilbert® Life series is a string of post on corporate culture from hell and dysfunctional organizations running wild. This can be quite shocking and sobering. A sense of humor will help when reading this. If you need to live in a sugar coated world were all is well and bliss and think all you do is close to godliness, stop reading right now and forget about the blog entries. It’s going to be dark. Pitch black at times actually, with a twist of humor, if you can laugh at yourself.

Attracting & retaining talent

If you listen to the talking heads in the media, recruiters & companies and read business related publications you’ll have noticed that when it comes to “Human Resources” there is supposedly a global war on. A war for talent. It’s not just attracting the best and brightest employees that is a concern but retaining them is even a bigger challenge it seems. When things are not to their liking they just pack off and fly off to the next awesome job opportunity which are available in vast numbers and give freedom to excel whilst paying great salaries.

They are talking about somebody else

Keeping employees happy is supposed to be a major concern in “the talent wars”. All companies are in this war we’re told. Perhaps even if just for the fact that no company will admit they are not looking for great talented employees. All evidence to the contrary I might add as a lot of organizations do not act as if they are in a war for talent at all. Good jobs don’t seem to be available in any decent number either. It often looks more like they are in a race to the bottom.

Last year of our major news papers had front page news. “War for talent? Forget it, that doesn’t exist”. They point to high unemployment, low wage jobs, social dumping, demographics, immigration, age, sex, race, … discrimination. In short a slew of reasons to conclude the war for talent doesn’t exist. Basically it boils down to this: if companies are in a a war for talent they can’t afford to lose so they can’t afford to act like this. Ergo, there is no war for talent.

I kind of disagree. There is most definitely a war for talent and there has always been one until computers & robotics outsmart us (dream on!). But let’s face it reality, 95% of us is not considered talent at all, but a resource, so we’re not in that war. As a resource we’re as expendable as ammo in a war. As long as they can keep the supply line filled they’ll fire (pun intended) and waste those resources at will.  Basically we’re lucky if we’re smart enough and young (cheap?) enough to be considered employable. Forget the lower 20% of our unskilled workforce, for them the deal is even rougher. And when you get fired at > 50, well good luck “grandpa”. All this while the talking heads blabber on about working beyond 67 …

You want proof? Look around you. Here’s where the war for talent is raging: A Google Programmer ‘Blew Off’ A $500,000 Salary At A Startup — Because He’s Already Making $3 Million Every Year. Well that isn’t me and probably not you either. Now don’t think everyone at Google is in that position, it’s a minority. => Techies CAN sue Google, Apple, Intel et al accused of wage-strangling pact. You see they want your talent, but not pay for it in free market.

Lets look at some evidence that there might be no war for talent.

Toys & work force multipliers are not salary or a career

BYOD, a smartphone, tablet, laptop paid for by work. They bombard us with commercials about how we need to supply & support this if we want to stand a chance to even attract young talent. That’s only partially true. If I’m true top talent I’ll be able to afford those my self, thank you. I’d rather take a 6 figure salary and 30 days paid vacation & affordable quality health care. After all you need to take good care of talent, right?

Performance Reviews

A golden oldie. When judging by the annual performance review practices out there, they are trying to make talent walk by proving to them the organization is too hopeless to even stop totally useless evaluation practices.

November 14, 1993

In corporate life your management often has no clue what you do. They often don’t even understand it. To add injury to insult you often have to write them yourself.

January 06, 2003

Usually there’s only  a stick

If you don’t have promotions, bonuses, rewards (not a merit badge, that’s just Neanderthal gamification done very, very wrong) or pay raises in place what’s with this war for talent anyway?

The fact that you can fire me if I’m not up to your standards? What kind of a messed up model is that? If we’re below standards you have a stick, I get that. If I meet, exceed or absolutely own those standards what exactly do you have to offer? Absolutely nothing? March 10, 1995

Ouch! We cannot do anything for you, it’s out of our control, they’ll tell you. Could be, but I cannot get away with that answer when it comes to delivering results. Do you even offer a career path? Employees don’t get promoted and if they do, it’s without a pay raise. Pay raises themselves are dead except for the legal minimum.

The exit interview to improve retention

The exit interview is as useful as a post mortem in preventing death. It helps find out what went wrong after the facts, but slightly less accurate than a real post mortem because in general the deceased don’t lie to you when you’re probing around and they always show up, all be it they have to be carried in. Just think the people left you was because while you’re great & wonderful and they just didn’t fit in and leave it at that. You’ll sleep better and waste less time.

You are creating your own hell

Most CxO types complain constantly about the lack of skilled employees that can think independently and have the ability to execute in order to achieve an end state.  In reality that is their own fault. The system doesn’t work. The expect to buy and discard talent at will. Well there isn’t enough talent to go around anymore because too many don’t really invest in developing it for short term accounting benefits.

Talent needs time and opportunity to develops skills and expertise. No one wants to give that any more. So you’re creating your own shortage as it’s not magically going to start growing on trees. Secondly when you have people that have the intrinsic motivation, drive and abilities to develop themselves to be experts you don’t reward them. Instead they demand ever more from them and pay them nothing more then anyone else or even less as you promote the bodies you can do without. We’re creating our own skills gap hell. But it’s easier to cry that you are a victim of a failing education system that doesn’t deliver experts that are experienced and cheap straight out of college.

Short term perceived gains for real long term damage & costs

Without the right people in the right place you no longer have analytical, design and architectural expertise. You have outsourced all that to vendors, “partners” and consultants. So now who can evaluate what is valid and valuable for you? No one. You’ll just get sold the flavor of the day that generates them the most profits. And of that doesn’t work there is always new stuff to sell you that will fix it. You fell for the trap of easy and cheap access to expertise meaning you lost all the expertise you had yourself. You are now dependent on mercenaries and their aim is to make money for themselves and survive even if it means killing you.  Every penny you spend wisely internally is an investment. Every penny you spend stupidly on a vendor is buying stuff that potentially makes you more dependent on them.

Companies are the ones to blame as they’re constantly in search of quick & dirty wins for short term (personal) gain. “Quick” is forgotten as fast as the word itself entails but the dirty part lingers around and stinks up the place long after the facts.

War for talent? Think again.

So exactly what’s the game play here? Employees doing exactly enough not to get fired? Because by the rules that ignore the above everything we do above that level is a misallocation of our resources. That’s very, very Office Space like dude.

image

image

In general it’s a race to the bottom leading to ever more mediocrity at ever higher costs and we all know who’ll get to pay the bill. Let’s hope some spin doctors can turn it into “good news”.

VEEAM Invests in Faster & More Efficient Data Protection With Backup & Replication 8


Ever more data to protect without breaking the systems or the bank

One of my major concerns today in IT, weather it is on premises or in the cloud, is the cost, time, reliability and feasibility of backup and restores. This true for most of us. Due to the environments in which I deliver my services my main issue with backups is the quantity of data. The amount of data is staggering and growth is not showing a downward trend.

The big four: CPU, Memory, Network & Storage

Over the years we have seen a vast increase in compute, memory, network and storage capabilities and pricing. CPUs are up to 18 cores per socket as I write this. DDR4 memory is here and the cost is relatively low. We have affordable 10Gbps networking to throw at the problem as well or in some case 8 to 16Gbps Fibre Channel. So when it comes to CPU, memory and network we’re pretty well served.

Storage is evolving as well and we’re getting ever bigger and, if you have the budget that is, faster storage arrays in different flavors. But it remains a challenge. First of all to get the right amount of IOPS and storage capacity at an affordable price point is a balancing act. Secondly when dealing with backups we need to manage the source IOPS & latency against the target. But that’s not all, while you might want to squeeze every last IOPS & 1ms latency out of your backup target you can’t carelessly do that to your source storage. If you do, this might constitute a Denial Of Service attack against your applications and services. Even today storage QoS is either non existent, in it’s infancy or at best limited to particular workloads on storage solutions.

The force multiplier: Backup software capabilities & approaches

If you’ve made sure the above 4 resources are not your killer bottle neck the backup software, methods algorithms and the approach used will be either your biggest problem or you best friends. You need your backup software to be:

  • Capable
  • Scalable
  • Fast
  • Configurable
  • Scale Out

There are some challenging environments out there. To deal with this backup software should be able to leverage the wealth of capabilities compute, network, memory & storage are offering to protect large amounts of data reliable and fast. This should be done smart and in an operationally supportable manner. VEEAM has been working on this for a long time and they keep getting better at this with every release and it allows for scale out designs in regards to backups targets.

VEEAM Backup & Replication 8.0

There are many improvements in v8 but a couple stand out.

image

Consistency groups (Hyper-V)

Backup jobs can execute more than one VM backup task simultaneously from the same volume snapshot with “Allow Processing of Multiple VMs with a single volume snapshot”.

image

This means you can reduce the number of snapshots significantly where in the past you needed a volume snapshot per VM. VEEAM limits the the maximum amount of VMs you can backup per snapshot to 4 when using software VSS and to eight with hardware VSS. They do this because under heavy load VSS/CSV sometimes has issues. This number can be tweaked to fit your needs (no all environments are created equally) with 2 registry values under HKLM\SOFTWARE\Veeam\Veeam Backup and Replication key:

  • MaxVmCountOnHvSoftSnapshot (DWORD)
  • MaxVmCountOnHvHardSnapshot (DWORD) registry values

Reducing the number of snapshots to be taken is good as it saves resources, speeds up things & as VSS can be finicky, not needing more than absolutely necessary is a good thing.

Backup I/O Control.

Another improvement is backup I/O Control which delivers capability to dynamically adjust the number of backup tasks based on IOPS latency. Under Options you’ll find a new Tabbed sheet, I/O Control. It contains the parallel processing option that used to be under “Advanced” tab in Veeam B&R 7.

image

The idea is to move to a more “policy driven” approach for handling the load backups can put on the storage. Until now we’d configure a number of X amounts of tasks to run against the source storage in order to keep IOPS/Latency in check. But this is very static and in a dynamic / elastic “cloud” world this isn’t very flexible nor is it feasible to keep tuned to the best number for the current workload.

I/O Control let’s you set limits on how much latency is acceptable for your data stores. Removing or adding VMs to the data store won’t invalidate your carefully set number of tasks allowed as it’s now the latency that’s used to dynamically tune that number for you.

I/O control has two settings:

 “Stop assigning new tasks to datastore at: X ms” :VEEAM looks at the latency (IOPS) before assigning a proxy (backup target) to a virtual disk or won’t launch the task until the load has dropped.  This prevents the depletion of IOPS by launching to many backups.

“Throttle I/O of existing tasks at: Y ms”: This will throttle the IO of already running  backup jobs when needed due to some application workloads in the VMs running on the source storage kicking in. The backups will be throttled so they’ll take longer but they won’t kill the performance of the applications while they are running.

These two setting allow for the dynamic and on the fly tweaking of the number of backups tasks running as well as their impact on the storage performance. Once you have determined what latency values are acceptable to you you’re done, VEEAM handles the tweaking for you. The default values seems to reflect industry best practices (sustained > 20 ms is considered problematic)

The below screenshot is for the backup job log and shows latency being monitoredclip_image002

With VEEMA B&R v8 Enterprise + You can even do this per data store, meaning you can optimize this per backup source. This recognizes that is no “one sizes fits all perfectly” and allows for differentiation. Yet it does so in a way that does not compromise on the simplicity of use that VEEAM offers. This sounds easy but from experience I know this isn’t. VEEAM manages to offer a great balance between simplicity and functionality for companies of all sizes.

Select “Configure”

image

In the “Datastore Latency Settings” you can add one, more or all data store you are protecting with VEEAM. This allows for differentiation when you have CSV that are used for SQL Server VMs versus stateless web servers of or other workloads that are not storage I/O intensive.

image

Select the datastore (in our case the CSV volumes in Hyper-V Cluster)

image

By selecting the desired datastore and clicking “Edit”  you can individually adjust the settings for that datastore.

image

Conclusion

It looks like we have some great additional capabilities in an already very good solution. I’ll be using these new capabilities in real life scenarios to see how these work out for us and optimize the backups of the virtualized environment under my care. Hardware VSS Providers, SANs, CSV’s normally need some tweaking and care to make them run well, so that’s what we’ll be doing.