[HamWAN PSDR] Service Impact Notice

Fri Mar 11 10:25:43 PST 2016

I'll echo the time constraints. We're looking at core infrastructure
deployment for Georgia, USA and have a lot of generalized interest in the
project.

We're experiencing similar volunteer constraints and have yet to begin full
operations. I can only picture how physical network operations are going to
proceed and suffer once those deployments start.

Regards,

Sam Kuonen, KK4UVL

On Fri, Mar 11, 2016, 12:29 PM Nigel Vander Houwen <nigel at nigelvh.com>
wrote:

> Bart, Rob,
>
> The biggest problem I see here is time resources. I brought this up to
> Bart off list, but there’s a continuing struggle to either have time to do
> the work yourself, or get other people to do the work.
>
> I deployed all of our monitoring and logging infrastructure, and I can say
> as a fact it’s been a struggle to get anyone to even do the basic work of
> adding new devices to the existing monitoring system, even after providing
> tutorials. This has gotten a bit better in very recent history, but it
> remains an issue.
>
> Automation is absolutely something we need to put more work into. Ryan and
> I have already put a bunch of work into this, which again, we have
> struggled to get folks to pick up, use, and contribute to.
>
> Modems breaking happens, and site access can be a significant problem. The
> East Tiger-SnoDEM link that Bart called out has been known down, but we
> can’t feasibly get that replaced in the middle of winter. Hopefully soon
> that can be taken care of.
>
> We can try to treat this like a production network all we want, but the
> reality is that we have effectively one part time staff trying to do, as
> Rob put it, both the Operations and Development work.
>
> The reality is that this is a network with VERY limited admin resources,
> which get split up to do various important things, the 900MHz work
> included, but that leaves even less available to do any day to day work.
> This isn’t our full time job, we’re not paid, we all have lives and
> families, we have VERY few people that actually volunteer to do any of the
> work, so the reality is there’s a lot we have a hard time getting to.
> Reality puts us much closer to “best effort” than “production”, and until
> we get more time/resources to do the work, it’s going to continue to be a
> struggle.
>
> If folks want to volunteer, I’d be happy to put them on improvements in
> monitoring, automation, and fixing things in the existing production
> network.
>
> Nigel
>
> On Mar 11, 2016, at 09:11, Rob Salsgiver <rob at nr3o.com> wrote:
>
> Bart,
>
> You touch on a few things that have been “niggling” at the back of my mind
> for quite a while now – most of them come down in one way or another to
> overall reliability (of HamWAN) for EMCOMM, which most know has been my
> main driver for supporting the effort.
>
> There’s been a TON of great work done and quite frankly, I’ve been amazed
> that HamWAN has gone as far and fast as it has, particularly for a “ham”
> effort.
>
> At the same time we’ve slowly been adding and attracting the attention of
> various EMCOMM organizations with the promise and potential of redundant,
> reliable, resilient communications when “the big one” hits.  Obviously not
> everything HamWAN is expected to survive a major quake or other event, but
> even pockets of reliable, high-speed communication are more than what can
> be accomplished via voice relays.
>
> All of which bring back to the current outage and discussion.  There have
> been several outages in key places since we began.  Last year SnoDEM was
> all but stranded due to a Haystack modem failure and other events at the
> same time.  Now we have a similar situation in a different place brought on
> by multiple failures or weaknesses.  In other instances I’ve been told
> we’ve had outages via misconfigured devices or other reasons.  Even in a
> perfect world, human error happens.
>
> I believe HamWAN would benefit from somewhat of a shift in operating
> philosophy that would create two separate departments or divisions –
> operations and development.
>
> Operations responsibilities
> 1)       Provide day to day monitoring of network resources and conditions
> 2)       Manage (admin) of those portions of the network that are
> designated as “in production”.  This should be the majority of the network.
> 3)       Provide communications and coordination of network maintenance
> 4)       Maintain an active inventory of all operational (production)
> sites, site hardware, and site access information.
> 5)       Maintain and manage all production site device configurations
> and config change management.
> 6)       Coordinate implementation of new functionality introduced by the
> Development department with appropriate monitoring, end-user communication,
> etc
> 7)       Recommend topics and technologies to be explored by the
> Development team to enhance operational stability and delivery of new
> features to the network.
> 8)       Document technologies, methods, and tools selected for use (and
> why) from an operational standpoint.
> 9)       Maintain an active inventory of spare hardware to support all
> sites.
> 10)   Establish a plan to correct ALL key site failures within XXXX days.
> 11)   Coordinate with Development to actively inject and test network
> failures and redundancy capabilities.
> 12)   Coordinate with Development to enhance HamWAN’s ability to operate
> in “pockets” when portions of the network fail in an earthquake – i.e. –
> each “island” stays operational with as many services as possible
>
> Development responsibilities
> 1)       Continued exploration of new hardware, software, and network
> management tools (Quagga vs BIRD, Metals vs QRTs, etc)
> 2)       Conduct experimentation with new hardware and software on
> separate network resources where possible, or in coordination with
> Operations on the larger network (more on this below).
> 3)       Document technologies, methods, and tools explored and indicate
> pros/cons of each where possible.
> 4)       Continued exploration, analysis, and documentation of available
> antenna and shielding designs
> 5)       Exploration of new antenna designs and/or other hardware?
> 6)       Exploration of new frequencies and how they are affected by
> terrain, vegetation, weather, etc
> 7)       This particular list can go on FOREVER
>
> The distinction here is largely mental, but it’s important.  It is
> entirely probable to have the same people in both groups, yet having the
> separation is important if HamWAN wishes to be taken seriously as a
> services provider to the EMCOMM community.  Any benefits from that would
> also improve service for ALL HamWAN users.
>
> Having EMCOMM onboard is important.  Not only does it provide a needed
> service to them, but if critical mass can be achieved it gives HamWAN
> access to multiple sites in every city and county.  In turn though, HamWAN
> as a network needs to be reliable in the “customer’s” eyes.  This means
> that infrastructure is managed with uptime as the highest priority,
> experimentation is managed to minimize adverse production impacts, and
> equipment failures are identified and corrected quickly.
>
> This is admittedly a fair amount of work.  Much of it I suspect is already
> underway – maybe not just quite in this format.  Additional help will
> definitely be useful.  Everyone involved only has so much time available,
> and they should be able to focus on those items that are important to
> them.  I believe the above framework (or something similar) begins to put
> some useful structure in place that continues to shape HamWAN from being
> the “wild west” of amateur and network “geek” exploration into the
> reliable, commercial grade, disaster resistant, amateur platform it
> envisions to be - while still allowing amateurs to push the limits of
> technology like they are meant to.
>
> If the above (or something similar) is of interest to the current
> directors and group as a whole, we can easily create a similar worklist
> that individuals on the sidelines can start picking things they can help
> with to help bring about.
>
> Just ideas.  Not saying they’re perfect, but it’s a start.  Any other
> thoughts?
>
> Cheers,
> Rob Salsgiver – NR3O
>
> *From:* PSDR [mailto:psdr-bounces at hamwan.org <psdr-bounces at hamwan.org>] *On
> Behalf Of *Bart Kus
> *Sent:* Friday, March 11, 2016 12:56 AM
> *To:* psdr at hamwan.org
> *Subject:* Re: [HamWAN PSDR] Service Impact Notice
>
>
> Hmm that's not the whole story though.  If it were just the 1 router
> failure (in reality a hypervisor failure), we'd be in a much better
> position, but it's combined with 2 other modem failures.  We had the
> ETiger->SnoDEM modem die over the winter, and it needs replacement.  That
> link has been down for a month or more now.  And most recently we're having
> the Tukwila->Baldi modem lose connectivity frequently.  We've implemented
> an automatic mitigation for that, but it still produces sporadic short
> downtime windows of a few minutes.  I'd just like to move that modem to a
> NetMetal 5.  Our servers are also being affected by instability in the
> Quagga routing software.  We need to replace this with a more stable
> alternative, like BIRD.  Lastly, the Baldi emergency uplink is only
> configured to go to Westin and Corvallis, but not Tukwila.
>
> We could have avoided DNS outages too, if the anycast groups were
> populated with more of the available servers.  I believe lack of good
> automation for server build-outs is causing the deployment lag here.
>
> The network is designed to withstand failures, even multiple failures, but
> we've got many broken things right now that need fixing.  After that
> fixing, I would really love to see some folks get behind improving our
> monitoring, deployment and diagnostic automation.  Networks like this won't
> scale unless they're nearly completely automated and simple to manage.  I
> would not mind at all if we even rolled back some features until we can get
> them re-implemented in 100% automated ways.
>
> As important as all this is, I still think the deep penetration project
> takes precedence, so I can't drop that work in favor of this.  Aside from
> helping out on the simple break-fix stuff, I mean.
>
> --Bart
>
> On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
>
> Thanks for the update, Nigel.
>
> On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel at nigelvh.com>
> wrote:
>
> Hello All,
>
> Just wanted to send out a quick notice here. We’ve had a failure at our
> Seattle edge router, which we’re still investigating. In the meantime, our
> Tukwila edge router is still providing connectivity, but you may notice
> higher latencies or issues reaching things. If you find things you can’t
> reach, please let me know, as we’d like to make sure the redundancy is
> working, while we’re working to resolve the issues we’re investigating with
> the Seattle edge router.
>
> Nigel
> _______________________________________________
> PSDR mailing list
> PSDR at hamwan.org
> http://mail.hamwan.net/mailman/listinfo/psdr
>
>
>
>
> --
>
> Ryan Turner
>
>
>
> _______________________________________________
>
> PSDR mailing list
>
> PSDR at hamwan.org
>
> http://mail.hamwan.net/mailman/listinfo/psdr
>
>
> _______________________________________________
> PSDR mailing list
> PSDR at hamwan.org
> http://mail.hamwan.net/mailman/listinfo/psdr
>
>
> _______________________________________________
> PSDR mailing list
> PSDR at hamwan.org
> http://mail.hamwan.net/mailman/listinfo/psdr
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hamwan.net/pipermail/psdr/attachments/20160311/b02b59cf/attachment-0001.html>