The primary purpose of this page is to informally document the changes (hopefully improvements) with NetIn.com. It is a place where our customers (and potential customers) can track what has been done, and where NetIn.com is going. Again, it is very informal, and any comments or suggestions you have are welcome.
The main web server crashed. I suspect that the server got too cold, as I was drawing in air from the outside and it was near freezing last night. Secondly, the min/max recording thermostat is faulty.
Our main web server died and was restarted at 11:45am (which is when I noticed it - hey, it was a late quakeworld night). I do not yet know the cause. It had a uptime of 65 days before this event.
The Verio network (our upstream provider) went dead. I'm still trying to find out what happened.
Moved WAN at the Irving POP back to the T-1 span due to instabilities in the ISDN routing equipment. The manufacter was called and is involved in solving the problem.
Basicly, the 10/28/97 problem happened again, the backup lines were established, but did not route. This caused an overnight outage at the Irving POP, which affects only dialup customers.
Disaster struck. Before I got the backup lines in place (oh - the new 10Mbps connection means new equipment, new lines, new configurations, and of course new problems!) the new router went down. So far, I'm not sure of the exact cause of this, but I thought I'd post something so you would know what happened.
I've been real busy lately (and grumpy!) configuring our 10Mbps connection to the Internet, and a second POP site (one in GTE land, one in SWB). The systems seem to be pretty stable now (had a little routing problem earlier in the day). There will be some account changes, but I will let you know if you're affected. And, of course, all the monitoring software now needs to be reconfigured, along with web pages that need to be updated. A mans work is never done...
I replaced the old "get-router-stats" program with the mrtg program. All the history was lost, but I like the new format a lot better. In our typical open kimono style, you can view our stats.
Added an additional 64MB of RAM to fangorn - so fangorn was rebooted. This was needed due to the size of the data files for usage reports.
The primary Internet connection went down for about 5 minutes. I established the secondary connection, but the primary was restored a couple of minutes later. We suffered about a minute of actual outage while detecting the problem and rolling over to the backup.
Noticed and fixed a bug in the usage report where requesting information for September gave data for August. This is now corrected.
Moved the new node, fangorn, from test into production. Fangorn is a dual Pentium Pro system sitting on a 100BaseTX segment, and uses 10,000rpm Ultra SCSI fast and wide Cheetah drives.
Verio was performing some upgrades and things went bad. We lost connections to all the mae's at one time or another, but problems seemed to center around mae-east. Things were stable again by 9:45pm.
Discovered that bagend's clock was drifting. This is important as bagend tracks things like WAN usage. Anyway, I fixed it - but the WAN usage reports probably are not accurate for days before 8/6/97 (they show usage being too high, as the clock was running slowly).
Moved ethernet based equipment onto our new switched 100Mbps backbone and to our new 10Mbps switch. Most likely, you will not notice a huge difference, as the previous LAN equipment was kick'n butt. This just makes sure we have the infrastructure in place to support growth.
Re-established primary WAN connections. There were a combination of problems starting at GTE's Cascade switch, a DCX, and finally to my DSU. The problem ticket is closed, but I'm going to watch this circuit for a while.
Lost protocol on primary WAN circuit. Rolled over to secondary circuit.
Moved equipment around. Two routers were rebooted (not at the same time! ;^) when moving them to new UPS's. There was a small (less than 15 min at around 2am) outage while I rerouted some lines.
A beta version of a web based application to check your usage is now available. This report is password protected, and is up to date as of the last time you logged out (ie: it does not contain information about your current session). I've also created a page so that you can change your password,. Please let me know if you have any problems with these programs.
I've also put up a page graphing our main WAN connection usage - for those who are interested in such things. It's updated several times a day, so remember to press reload...
Well - it was one of those days. It started with MAE-West's Giga-Switch going down (that means loosing a T3 for us) for several hours. We still have 2 other T3's going to different NAP's, which should have handled the load, but my provider took the opportunity to upgrade/reconfigure some ports, and routing got screwed up. I still don't have the full story... But MAE-West is back and so is normal routing/operations.
It's been a week, and nobody has complained of a "cause code 18". I think we have that problem solved - thanks to Roger Young (of GTE) again!
GTE finished re-configuring the primary hunt group to help with the intermittent ISDN cause code 18 problem.
Established alternate routing and upgraded the software on the Cisco router. Rebooted the router.
Established alternate routing and upgraded the software on both Ascend routers. Rebooted the routers.
We've added more dial-in lines. We were not running out of lines, but we're growing and want to make sure that we do not run out of lines.
Some of our ISDN customers have notified us that they are experiencing a "Connection Terminated - Cause Code 018" message when they try to connect. As NetIn.com has ample capacity, this has to be a problem with the local exchange carrier (GTE or SWB). I have opened trouble tickets with GTE, but would appreciate you letting me know if you get this error.
The GTE switch upgrade went without a hitch! Way to go GTE!!! (I would have lost that bet!)
GTE recognized and fixed their ISDN call routing problem to Richardson.
GTE has lost call routing information to Richardson. They are working on the problem.
Certain logins were prevented from loging into the system when a router refused to route to the "0" subnet (which is where an authentication server resides). Routing through an alternate router was established, and the troubled accounts are able to login again.
As some users use a different authentication server, we had customers logging into the system and I did not notice any problem. Our customer Terry Sullivan was the first to report the problem. Without this information, who knows when I would have noticed. Our sincere thanks to Terry!
Replaced a possibly intermittently failing DSU, which I suspect being part of the weekends problems.
We had spuratic routing problems throughout the day. I found and corrected some incorrect routing table entries, and everybody seems to be happy. This should be the end of last nights fiasco (knock on wood).
A customer called and reported that the name server was down, so I tried to reset it. It turns out that the named server was fine but I discover that while the main Internet WAN connection was up, the line protocol for that connection was not (this is not something that normally fails, or something I regularly check). As I could reach my router, I assume it is a problem on my provider's side, and call to report the problem. While they are trying to fix their router, I establish alternate routing over a backup line, but while doing this, the primary WAN connection is bouncing up and down. This really screws up my automatic failover routines, so my arp and routing tables are totally screwed up. Great, now I have to hook up consoles to the routers to be able to configure them, because all the devices on the main ethernet segment are so confused that I can't telnet... This story goes on, but as I'm tired, I'll shorten it and say that things finally stablized at 2:30am. Time for bed.
The main WAN connection went down. Alternate routing was established, but there were some routing problems on some new segments. The main WAN circuit was "bouncing" up and down for a while, and GTE could not tell why. The connection became stable by around 2:00pm, and I re-established "normal" routing.
Updated the OS on the Ascends, enabling NAT and multicast support. However, I forgot to restore the user profile information before rebooting (Oops), so one machine took about a 10 minute service outage.
The name server lost sync, and would not re-sync. I restarted the name server at about 5:15pm, and all is well.
The main WAN connection went down around 3:35pm. Alternate routing took effect across backup ISDN lines, and a call was made to GTE. GTE fixed the normal WAN connection in approx. 15 minutes. Normal routing was back in place before 4:00pm.
GTE ISDN Guru Roger Young did the impossible by putting our lines properly into our hunt group. Many thanks Roger!
Discovered that GTE had 40% of our "dial in" lines misconfigured such that the lines were not usable.
Launched a new pricing program to reward efficient users with lower prices.
At ~6:15pm, NetIn.com's main Internal LAN suffers a total blackout when an ethernet card on a backup server (bagend) went faulty. The problem was fixed, and the LAN was restored by 6:40pm, for a total outage of a half hour. Several servers, routers, etc, were rebooted in the diagnosis process. It sort of makes you wonder about the necessity of backup machines when a backup machine causes the outage!
A stale and unremovable file lock on a cdrom file caused me to reboot rivendell (it seemed like a good time, as no one was on the system).
Created this page.
Reconfigured the failover software on the backup server.
Added the ability for our customers to have their own cgi-bin directory and programs. These cgi-bin programs run with the owners permissions (with power comes responsability...) See the user setup for instructions.
Upgraded the OS and support software on our backup server. For the curious types, the backup server is now running Slackware distribution 3.1.0, kernel version 2.0.27.