[HamWAN PSDR] Service Impact Notice

Fri Mar 11 16:23:33 PST 2016

Thanks Ed/Sam/Nigel/Bart/all for the thoughts.

First and foremost I want to make sure that I am in NO way criticizing the huge amount of effort that has gotten us this far, OR  the efforts of ANY individual who put their time and energy into the project.  As a past casualty of throwing myself into “the cause” I know all too well the toll it can take, and I sincerely appreciate the time everyone has invested.

It is also precisely for that reason that I brought the topic up.   Like many other well intended projects, HamWAN has reached a threshold that it must pass in order to continue to grow, or that it must stay below and be content with what it has achieved.  Without some form of change, HamWAN (and HamWANs throughout) face the same risks as a lot of other amateur-related efforts – namely to “die on the vine” due to lack of support and eventual “creator” exhaustion.  I would also hate to see all of this effort, goodwill, and potential be wasted if HamWAN were to fold or otherwise fail to reach it’s potential.  The vision is valid, and the potential in the 2nd and 3rd generations of amateur WAN connectivity and communications is truly enormous.  It is only fair that we continue to mold the idea of HamWAN as necessary to see that the work that has been done is not in vain, and that the efforts of the creators are rewarded and grown in the future.  /sap /sermon

I submit that the separation of Operations and Development is not as much a staffing need as it is a MENTAL need – although additional people devoted to both areas would obviously be beneficial.  

In the beginning EMCOMM was identified as an amateur-related activity that could really benefit from what HamWAN could bring to the table.  Conversely, EMCOMM also has the potential to provide HamWANs with access to high value sites to deploy and establish coverage.  These needs are highly symbiotic and also carry responsibilities.  There have been numerous amateur efforts to implement digital communications that have been successful in pockets yet failures overall.  D-star, packet, you name it.  Most of it boils down to complexity to the end-point operator.  How many hams can (or care to know) how to tweak a TNC, only to have RF, radio parameters, TNC parameters, or the operator on the other end not be able to complete a message.  Over the years a lot of time and expense has been put into amateur “solutions” by Served Agencies, only to have them lie unused today.  HamWAN shares some of those risks, but not all.  The use of current (and cutting edge) networking technology eliminates one of the pitfalls of packet – namely obsolescence.  Yes, packet still has its place but it’s not high speed, high volume data communications.  In order to be accepted in EOCs and other EMCOMM sites, HamWAN must “mentally” be better than the alternatives.  It has to be reliable, robust, and treated like a serious disaster resource platform.  While one of the attractions of HamWAN is to provide an environment for amateur experimentation, it CANNOT be done on parts of a network that are we are asking EMCOMM entities to help fund, provide sites for, and operate on behalf of.  If EOCs are providing cell sites that are down 10, 20, or 50% of the time due to network “experimentation”, misconfigurations, or hardware failures that we don’t have the manpower to support, there is little incentive to spend public money on yet another amateur “solution” that will not work when it’s actually needed.  Conversely, we need to “build” an environment that lends itself to as little human “babysitting” required as possible, while allowing the necessary experimentation to continue advancing the art.

As mentioned in a couple of emails, expectation management is key.  This applies both ways.  We are asking to be invited, but also need to evolve where we (HamWAN) can be counted on.  We can go the path of setting up parallel but separate networks for EMCOMM and HamWAN as an experimentation platform.  Unfortunately we either double the hardware at each site to support both (and the support needs), or we lose potential sites due to lack of supporting entity(ies).  I argue that having EMCOMM being supported as a key component of ALL HamWANs in a single network is the most beneficial to both communities.  EMCOMM gets a disaster communications solution that doesn’t exist elsewhere at a cost that doesn’t sink taxpayer funded budgets, and for that “price” amateur-based HamWANs get access to premium sites that would not otherwise be available.  If this model still makes sense now that HamWAN is in year 3 or 4, then it is merely a question of how best to set things up to continue moving forward.

The points of separating Operations and Development are more mental and organizational in nature.  By creating the distinction, you accomplish the following:

1)       Offloading (with documentation and guidance) the daily maintenance tasks to others in manageable bites – even more so with greater automation

2)       Establish that network stability and reliable operation take priority over all else.

3)       Development and change management happen under a framework within a group / email list.  i.e. – avoid the situations where changes are implemented partway due to the only guy working on it having 20 minutes after work before going to bed and then something elsewhere goes down via unintended consequences.  It’s not a slam on anyone, it’s basic change management that we all use at work.

4)       Involve more people by creating roles that are less technically encompassing in scope – i.e. – not everyone needs to be a telecom, Microsoft, or Amazon network engineer to be able to contribute.

5)       And this list can go on…

To Bart’s most recent points I have no disagreement that as much as possible can (and should) be automated.  At the end of the day though, somebody (or a few somebodies) have to call the shots on when and how network changes, maintenance, and support happen.  This is more from a standpoint of making fundamental changes to the infrastructure design.  Easy examples would be changes in routing protocols, global router settings changes, manual changes to accommodate temporary network conditions or outages that may unintentionally conflict with automated features, etc.

I didn’t get a copy of Ed’s email that discussed Phase 1 and Phase 2 issues, so I’ll have to pass on that.  Initially I didn’t get Nigel’s response either until I got it in one of Bart’s replies.   Not sure what’s going on there.

Where do we (HamWAN) have the current list of topics that are under development and/or need help?  Probably 50% of the time I have IRC running in the background at work, and most of the conversations I see with regard to a specific topic are people engaged in that particular task, but never a running list of what’s being worked on what could use help on.  While I haven’t been through the reworked website with a fine tooth comb, I don’t remember seeing those topics there either.  The flip side is I also understand that the few who have been working on them typically don’t want to take the time to keep publishing lists either, so it’s somewhat of a catch-22.  

In my personal case, in the dinosaur age I was a pretty skilled admin and could hold my own in most areas.  I’ve been away quite a while in management roles and would have some catching up to do for sure, but probably could be of use somewhere.  Once upon a time I knew my way around Cisco IOS, today I’m re-learning Mikrotik <g>.  Others on the list I’m sure have other skills that might be able to help, but where do they start that they can be useful?

Lastly the topic of writing software.  I understand the advantages of having something that does exactly what you want and it’s also a part of experimentation and advancing the art.  At the same time, anything that is custom either needs the creator to be immortal to continue support into the future, or it needs to be documented to the level of being able to train your replacement.  Google works wonders for looking up support information on off the shelf products.  Not as well for custom software.  I’m not saying it can’t be used, only that there are definite trade-offs.

Ok.  I’ll stop for now.  Thanks for taking the time to read & respond.  

Cheers,

Rob

From: PSDR [mailto:psdr-bounces at hamwan.org] On Behalf Of Bart Kus
Sent: Friday, March 11, 2016 1:12 PM
To: psdr at hamwan.org
Subject: Re: [HamWAN PSDR] Service Impact Notice

Replying the to the latest fully-quoted message instead of Ed's, but Ed your observations are spot on.

Rob, I think the concept of network ops is finished both in the industry and for HamWAN.  In the industry, we're working at such enormous scales that you cannot possibly staff enough people to do any of the ops tasks manually.  Even if you did, the unavoidable human failure rate would cripple your resulting system.  In HamWAN, we have the same problems as industry (albeit at a microscopic scale), but additionally requiring staff to operate things is an adoption hurdle.  We don't have the incentive of wages to staff these required job functions.  Combine that with a general lack of computer/network knowledge in the ham community and you're doomed, even if you did manage to gather enough well-meaning people to support you.

This problem isn't unique to the Puget Sound Data Ring.  Everyone else trying to implement a HamWAN will face the same challenges, as Ed correctly points out.  We need to make the leap from phase 1 to phase 2 (see Ed's email), because we've been successful enough (yay!) to grow to such a scale that we're starting to fail at phase 1.

HamWAN has so far delivered interfacing standards, and a bunch of docs that educate people on suggestions (not standards) for how to configure the non-standardized parts of your network.  That's a good starting point, but now that we know our standard ideas work reasonably well, it's time to take on the additional task of making them self-implementing in new HamWAN instances.  This means a lot of software development.

And therein lies the problem.  In this project we have maybe 2 people who can help write the software required.  For us to successfully make the leap from phase 1 to phase 2, we've got to become attractive to people who write software.  A team of 6-10 folks would give us a good chance at making the leap.

I'm not sure how to do recruiting for this, but don't let that be the seminal question of this email.  I'd like to hear from people if they agree with the direction shift I've proposed here.

--Bart

On 3/11/2016 10:25 AM, Sam Kuonen wrote:

I'll echo the time constraints. We're looking at core infrastructure deployment for Georgia, USA and have a lot of generalized interest in the project.

We're experiencing similar volunteer constraints and have yet to begin full operations. I can only picture how physical network operations are going to proceed and suffer once those deployments start. 

Regards,

Sam Kuonen, KK4UVL

On Fri, Mar 11, 2016, 12:29 PM Nigel Vander Houwen <nigel at nigelvh.com <mailto:nigel at nigelvh.com> > wrote:

Bart, Rob,

The biggest problem I see here is time resources. I brought this up to Bart off list, but there’s a continuing struggle to either have time to do the work yourself, or get other people to do the work.

I deployed all of our monitoring and logging infrastructure, and I can say as a fact it’s been a struggle to get anyone to even do the basic work of adding new devices to the existing monitoring system, even after providing tutorials. This has gotten a bit better in very recent history, but it remains an issue.

Automation is absolutely something we need to put more work into. Ryan and I have already put a bunch of work into this, which again, we have struggled to get folks to pick up, use, and contribute to.

Modems breaking happens, and site access can be a significant problem. The East Tiger-SnoDEM link that Bart called out has been known down, but we can’t feasibly get that replaced in the middle of winter. Hopefully soon that can be taken care of.

We can try to treat this like a production network all we want, but the reality is that we have effectively one part time staff trying to do, as Rob put it, both the Operations and Development work.

The reality is that this is a network with VERY limited admin resources, which get split up to do various important things, the 900MHz work included, but that leaves even less available to do any day to day work. This isn’t our full time job, we’re not paid, we all have lives and families, we have VERY few people that actually volunteer to do any of the work, so the reality is there’s a lot we have a hard time getting to. Reality puts us much closer to “best effort” than “production”, and until we get more time/resources to do the work, it’s going to continue to be a struggle.

If folks want to volunteer, I’d be happy to put them on improvements in monitoring, automation, and fixing things in the existing production network. 

Nigel

On Mar 11, 2016, at 09:11, Rob Salsgiver <rob at nr3o.com <mailto:rob at nr3o.com> > wrote:

Bart,

You touch on a few things that have been “niggling” at the back of my mind for quite a while now – most of them come down in one way or another to overall reliability (of HamWAN) for EMCOMM, which most know has been my main driver for supporting the effort.  

There’s been a TON of great work done and quite frankly, I’ve been amazed that HamWAN has gone as far and fast as it has, particularly for a “ham” effort.  

At the same time we’ve slowly been adding and attracting the attention of various EMCOMM organizations with the promise and potential of redundant, reliable, resilient communications when “the big one” hits.  Obviously not everything HamWAN is expected to survive a major quake or other event, but even pockets of reliable, high-speed communication are more than what can be accomplished via voice relays.

All of which bring back to the current outage and discussion.  There have been several outages in key places since we began.  Last year SnoDEM was all but stranded due to a Haystack modem failure and other events at the same time.  Now we have a similar situation in a different place brought on by multiple failures or weaknesses.  In other instances I’ve been told we’ve had outages via misconfigured devices or other reasons.  Even in a perfect world, human error happens.

I believe HamWAN would benefit from somewhat of a shift in operating philosophy that would create two separate departments or divisions – operations and development.  

Operations responsibilities

1)       Provide day to day monitoring of network resources and conditions

2)       Manage (admin) of those portions of the network that are designated as “in production”.  This should be the majority of the network.

3)       Provide communications and coordination of network maintenance

4)       Maintain an active inventory of all operational (production) sites, site hardware, and site access information.

5)       Maintain and manage all production site device configurations and config change management.

6)       Coordinate implementation of new functionality introduced by the Development department with appropriate monitoring, end-user communication, etc

7)       Recommend topics and technologies to be explored by the Development team to enhance operational stability and delivery of new features to the network.

8)       Document technologies, methods, and tools selected for use (and why) from an operational standpoint.

9)       Maintain an active inventory of spare hardware to support all sites.

10)   Establish a plan to correct ALL key site failures within XXXX days.

11)   Coordinate with Development to actively inject and test network failures and redundancy capabilities.

12)   Coordinate with Development to enhance HamWAN’s ability to operate in “pockets” when portions of the network fail in an earthquake – i.e. – each “island” stays operational with as many services as possible

Development responsibilities

1)       Continued exploration of new hardware, software, and network management tools (Quagga vs BIRD, Metals vs QRTs, etc)

2)       Conduct experimentation with new hardware and software on separate network resources where possible, or in coordination with Operations on the larger network (more on this below).

3)       Document technologies, methods, and tools explored and indicate pros/cons of each where possible.

4)       Continued exploration, analysis, and documentation of available antenna and shielding designs

5)       Exploration of new antenna designs and/or other hardware?

6)       Exploration of new frequencies and how they are affected by terrain, vegetation, weather, etc

7)       This particular list can go on FOREVER

The distinction here is largely mental, but it’s important.  It is entirely probable to have the same people in both groups, yet having the separation is important if HamWAN wishes to be taken seriously as a services provider to the EMCOMM community.  Any benefits from that would also improve service for ALL HamWAN users.

Having EMCOMM onboard is important.  Not only does it provide a needed service to them, but if critical mass can be achieved it gives HamWAN access to multiple sites in every city and county.  In turn though, HamWAN as a network needs to be reliable in the “customer’s” eyes.  This means that infrastructure is managed with uptime as the highest priority, experimentation is managed to minimize adverse production impacts, and equipment failures are identified and corrected quickly.

This is admittedly a fair amount of work.  Much of it I suspect is already underway – maybe not just quite in this format.  Additional help will definitely be useful.  Everyone involved only has so much time available, and they should be able to focus on those items that are important to them.  I believe the above framework (or something similar) begins to put some useful structure in place that continues to shape HamWAN from being the “wild west” of amateur and network “geek” exploration into the reliable, commercial grade, disaster resistant, amateur platform it envisions to be - while still allowing amateurs to push the limits of technology like they are meant to.

If the above (or something similar) is of interest to the current directors and group as a whole, we can easily create a similar worklist that individuals on the sidelines can start picking things they can help with to help bring about.

Just ideas.  Not saying they’re perfect, but it’s a start.  Any other thoughts?

Cheers,

Rob Salsgiver – NR3O

From: PSDR [mailto:psdr-bounces at hamwan.org] On Behalf Of Bart Kus
Sent: Friday, March 11, 2016 12:56 AM
To: psdr at hamwan.org <mailto:psdr at hamwan.org> 
Subject: Re: [HamWAN PSDR] Service Impact Notice

Hmm that's not the whole story though.  If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures.  We had the ETiger->SnoDEM modem die over the winter, and it needs replacement.  That link has been down for a month or more now.  And most recently we're having the Tukwila->Baldi modem lose connectivity frequently.  We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes.  I'd just like to move that modem to a NetMetal 5.  Our servers are also being affected by instability in the Quagga routing software.  We need to replace this with a more stable alternative, like BIRD.  Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila.

We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers.  I believe lack of good automation for server build-outs is causing the deployment lag here.

The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing.  After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation.  Networks like this won't scale unless they're nearly completely automated and simple to manage.  I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways.

As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this.  Aside from helping out on the simple break-fix stuff, I mean.

--Bart

On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:

Thanks for the update, Nigel.

On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel at nigelvh.com <mailto:nigel at nigelvh.com> > wrote:

Hello All,

Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router.

Nigel
_______________________________________________
PSDR mailing list
 <mailto:PSDR at hamwan.org> PSDR at hamwan.org
 <http://mail.hamwan.net/mailman/listinfo/psdr> http://mail.hamwan.net/mailman/listinfo/psdr

-- 

Ryan Turner

_______________________________________________
PSDR mailing list
 <mailto:PSDR at hamwan.org> PSDR at hamwan.org
 <http://mail.hamwan.net/mailman/listinfo/psdr> http://mail.hamwan.net/mailman/listinfo/psdr

_______________________________________________
PSDR mailing list
PSDR at hamwan.org <mailto:PSDR at hamwan.org> 
http://mail.hamwan.net/mailman/listinfo/psdr

_______________________________________________
PSDR mailing list
PSDR at hamwan.org <mailto:PSDR at hamwan.org> 
http://mail.hamwan.net/mailman/listinfo/psdr

_______________________________________________
PSDR mailing list
PSDR at hamwan.org <mailto:PSDR at hamwan.org> 
http://mail.hamwan.net/mailman/listinfo/psdr

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hamwan.net/pipermail/psdr/attachments/20160311/ecfb834d/attachment-0001.html>