The primary purpose of this page is to informally document the changes (hopefully improvements) with NetIn.com. It is a place where our customers (and potential customers) can track what has been done, and where NetIn.com is going. Again, it is very informal, and any comments or suggestions you have are welcome.
Updated mail server software on securemail.
Removed Osirusoft RBL configs from mail servers as they decided to blacklist the world. We understand they have shut down operations.
Updated the firmware and reloaded (rebooted) all our Cisco routers due to the security announcement Cisco made today (ie: security problem in previous versions of Cisco's IOS). The restarting of the routers caused a small (less that 10 minute) outage for all customers except dialup and ISDN (which were not affected).
There will be misc. routing weirdness while other network service providers do the same updates...
Well, in one of those best laid plans of mice and men stories, our triple redundant, it's never supposed to go out, backup power system failed at our main POP. There were storms in Dallas and lightning hit the Infomart. Power failed over to the battery backup power system, and then to the generators. But when normal power was restored, the generators turned off but power didn't switch back to the commercial power source. When power was restored, some systems booted from the wrong images, or used the wrong configuration profile, etc. So certain things worked, and other things didn't. It was very hard to track (call it a network administrator's typical nightmare). By 5am, things were back to normal.
We did some software updates to our email servers. We did the updates at different times for each of the servers. The updates took about 1 minute per server. The individual servers would be unavailable during the update process, so users might have seen a small outage for mail. No mail was lost, as new mail was queued at backup servers.
A storm rolled through Dallas causing a power outage at our Irving POP for about 2 hrs. Battery backup lasted about an hour. After that, all our phones were out, and Internet service to ISDN customers in the mid-cities and Ft. Worth areas were down. All other Internet services continued to operate normally (as they are serviced through our Dallas POP).
Added new notes section to the securemail system.
Published web pages on mail filtering service.
Deployed new mail filtering service.
Deployed a new NS1 name server.
The SWB Riverside central office is transitioning back to their primary facilities. This caused about a 10 minute outage for T3, T1, Frame-Relay, ATM, and DSL services. I assume they will be switching the rest of the services too, so the ISDN and analog dialup customer should expect a short outage when they do that. I do not know when that will be, but it will probably happen in the next 6 hrs or so...
The SWB Riverside central office suffered a power outage, causing an outage for us and much of downtown Dallas. All T3, T1, Frame-Relay, ATM, DSL, ISDN, and analog dialup services (about everything except hosting services) were affected. At 1:55am, most services were resumed. By 2:30am, everything but 1 customer's T1, and a few ISDN services were running.
An interface reset seem to cause a cascading routing protocol failure across the NetIn.com backbone. This caused about 10 minutes intermittent Internet connectivity. The situation was cleared at about 11:00 am.
Reconfigured and restarted the SWB based DSL network statistics (this means all the graphs start new).
Moved SWB based DSL customers to new IMA circuit.
The web server on our secure web based email system was down do to updates that did not get installed correctly. The secure web server appears to have stopped yesterday at around 10:30am. The mail server itself was up and running, no mail was lost. It was the web based interface to the mail system that failed, and is now fixed.
A SCSI tape device on the mail server locked the SCSI bus of the Irving mail server. This prevents the mail system from writing to the disks. We reset the SCSI bus to fix the problem.
Tried to do apply some updates to a backup router at the Irving POP, but when the router rebooted it messed up the routing tables, causing an outage for some of the Irving ISDN customers.
Updated the firmware on the ISDN terminal servers in Irving and Dallas. This took a 5 minute outage as the machines rebooted.
Our main ISDN and analog dialup equipment at the Dallas facility died at about 9:30am. We do not know the full extent of the problem with the hardware, but the whole box was replaced. Service to most of the ISDN and analog dialup customers was restored by about 6:30pm. We did loose the connection profiles for a couple of customers, but we have contacted them (for password info, etc.). I'd like to take this time to thank Deb of our office for all her hard work and the professional way she handled an incredibly stressful day. Thanks Deb!
Our SWB based DSL customers had spotty service during this time. I entered a debugging command in a router that the router did not like. I had to reboot the router to clear the problem.
I changed the SWB DSL usage graphs section to be (hopefully) easier for you to use and easier for us to maintain. You used to need your interface id, which SWB doesn't give you, and is hard for us to lookup. Now you need your vpi/vci info, which is still technical info, but you should get this info when the service is installed, and it is fairly easy for us to lookup. The down side to this is that the data graphs start from the beginning again (ie: historical data before today is not represented in the graphs).
The news server is back online.
There were problems with the new version of IOS on brandywine, so I rolled back to an earlier version. This required a reboot of the router which caused a ~4 minute outage for SWB based DSL customers.
News server (process) died, due to the news history file growing past the largest file size of 2GB. This was due to some bad people flooding the control channel, and I'm still looking into how they did this. I rebuilt the history db, but didn't think that the overview db was way oversize, and ran out of diskspace. I'm currently rebuilding the history and overview db's.
Updated memory and IOS version on brandywine.
A drive on mail.netin.com quit responding, and was reset.
Experienced a denial of service attack.
The history db and overview sections have been rebuilt. News was restarted, and is re-synchronizing.
The news server is down. I know it has some corrupted files, but I'm not sure what else is wrong yet...
Ok - the disks are fine, the cycbuffs are fine, but history and overview are currupted and need to be rebuilt. This takes a long time...
Updated our border router and NIDS to Level 3. This should be totally transparent to our users.
We were having too many problems with our old news software so we replaced it with and new news server software. The new server was started at about 4:00pm.
Established peering with Internap.
Performed major updates on the news server, including reindexing the news spools.
Made security updates to most of our servers. The Irving mail server required a reboot (meaning it was offline for a couple of minutes while it rebooted).
The new drive on the statistics server failed, and had to be replaced again. While the machine was down, we updated its' other hardware and OS. We restored the history data from Sunday's backup - so there is a hole in the data starting on 12/2/2001 around 1pm until 12/4/2001 at about 3am.
A disk drive on our statistics server was failing, giving intermittent errors. We replaced the drive, and restored the history data from a backup.
Installed new anti spam filters on the news server.
Reindexed news server.
We published our new and improved web site!
A Level 3 core router bounced several times, causing a cascade of problems. First, there is a backup for this router, and traffic should have automatically been rerouted through the backup. However, that did not happen. If that path to the Internet is no longer available, the routing tables should have been updated, but since there was supposed to be a backup, the routing tables were not updated. This caused all the NetIn.com that goes through Level 3 to be dropped, instead of going through another path to the Internet. This means that, you may or may not have had an problem getting to certain sites during this time, depending on if traffic to your site is routed through Level 3. The Level 3 router bounced up and down several times making this whole situation rather difficult to understand and making access to Internet sites intermittent. At ~6:00pm, I manually downed the Level 3 interface so that traffic would be routed around Level 3 (so you get to all sites again). Level 3 seemed to get the problem fixed around 7:00pm. I re-established the connection and routing to Level 3 at about 7:30pm.
Updated the mail server software.
I've reconfigured some of the statistics gathering software for the new router, so some stats are available again...
Found an fixed a network protocol problem between us and Level 3. I turned off autonegotiation, and forced a configuration on the interface. There were routing problems while I was disconnecting and connecting the interface. As of 12:45am, everything seems to be back to normal.
Enabled new active firewall settings.
Installed and put into production a new intrusion detection system and new router. This also means that our stats for Internet traffic are going to be wrong until we reconfigure the monitors to look at the new router.
One of our news feeds (upstream provider) is having trouble, so I took this time to update our news server software and reindex the news spools...
The main Dallas ISDN terminal server reset itself, and shutdown the authentication servers. The servers were restarted and everything is working again. Thanks to Joe Magee for calling our attention to it.
Our mail servers are rejecting mail from servers listed on the ORBS database. We suspect that these servers were listed due to contracting the nimda virus, and the virus trying to spread itself via email - but this is just a guess. For now, we have stopped using the ORBS service.
Nimda worm is causing havoc with the Internet.
Reverted to older news server software and reindexed news server spools again.
Reindexing the news server spools.
Updated the news server software.
Updated the mail server software.
Rolled the Irving MAX back to a previous firmware version, due to some compatibility problems with Netopia ISDN routers. This caused a small (5 minute) outage for Irving ISDN and analog dialup customers.
Our T3 containing T1s to most of our customers (and ourselves) went down for 6 minutes. While we were still connected to the Internet, we were not connected to most of our customers. This problem affected most of our ISDN customers, and virtually all T1, Frame Relay, ATM, xDSL, and dialup customers.
The Irving POP rolled over to their backup lines when the primary line started experiencing errors. We took the primary line down for testing and took the time to update some equipment and software. Service to the Irving POP was interrupted when transitioning from the primary line to the backup (it takes a few minutes for new routes to be calculated and stabilize). But other than that short period, everything operated normally. The primary line problem was traced to a cable that went bad. We rolled back to the primary line at 2:15am.
The supernews news feed is restored. Supernews had a human error, and disconnected the wrong customer. We're unhappy about that, but mistakes happen. Thanks to Ross Ashley for bringing the problem to our attention!
One of our news feeds (the one from supernews) is down. We are looking into the problem.
Put into beta test a new caching web proxy with anonimizing features.
Put two new mail servers into production. These new mail servers have more spam prevention tools.
SWB took the T1 to Irving down for test, without telling us. Routing should have automatically rolled over to our secondary line, but didn't. I'm looking into this. Irving was down for about 15 minutes.
We put our new news server into production.
Some new security patches broke our secure mail server's ability to send email. This is now fixed.
We have restored most of our network stats. There is still more work to do, but many stats are there.
Updated our core router's firware. This caused about a 2 minute outage while the router rebooted.
Tried to update our core router's firmware. This did not go so smoothly, and caused intermittent interruptions between 11:30pm to 2:00am. We reverted back to the previous firmware.
Installed new interfaces on our core router. These are "hot swap" interfaces, so they caused no service interruption.
There was trouble with the Irving ISDN and analog dialup routers causing problems for some customers. The trouble was corrected and the router reset at about 1pm.
We are "swamped" with calls from people who did not change their DNS settings.
We have turned off Internet traffic graphing while we make more updates. This service will be reconfigured and returned, but is a lower priority item...
Did the cutover to new services. Only new addresses are routed. This means that anyone still using the old addresses cannot use the Internet. The transition did not go as smoothly as planned, which is probably obvious by the amount of time it took.
Emailed notifications to customers saying that the change over is rescheduled for 12/30/2000.
We were supposed to do a major cutover to new services from new vendors, but not all vendors were ready. The cutover was postponed. Emailed a notification to customers saying that this would have to be rescheduled.
Our invoice included a notification about how we are scheduling an outage for 12/28/2000 to install new services.
Emailed information to all customer about changes to our services (we're adding bandwidth, phasing out old IP addresses, moving equipment, adding new equipment, etc.) and instructions on how to change to our new DNS servers.
Switched our news server from Verio to Supernews. This may require news users to reset their news reader message caches.
The Verio co-lo room lost air conditioning at about 3:30pm. The more sensitive of our machines starting automatically shutting down at around 8pm (Fast hard drives are very sensitive to heat). We got things running again around 11:30. Verio continues to be unable to explain why this happened. The AC went off again on Thursday around midnight, but this was fixed in about an hour.
We experienced about a 1 hr outage (slightly less) when our main router shutdown all interfaces. We are researching why this happened.
We believe the problem was caused when Verio overloaded a backup power supply. The situation has been fixed.
One of our mail servers and authentication servers failed when adjusting from CDT to CST. It was reset by about 11 am. No mail was lost as all mail destined to this server was queued at a backup server. However, some dialin customers could not connect until the server was reset.
One of the lines used for GTE based DSL went down from around 3:30pm to about 4:30 PM This did not affect all our GTE based DSL customers, just the ones using that particular line. GTE fixed the problem and had things backup in at about 4:30 PM
Put new web based mail server into beta test.
The full story: ITU tried to replace a transformer last night, for our Irving POP, but I guess something wasn't quite right. The Irving POP went on battery power until about 4:15am. I didn't discover this until about 5:00am. I called ITU, and went to switch over to the generator. Of course the generator would not start... Anyway, I got the generator running at about 8:00am, and got most Internet related things running by about 8:30am. All the normal office equipment (including our phone switch) was still without power, so we still couldn't take voice calls (sorry - but I did get the Internet equipment powered! ;^) I didn't discover until later that our primary name server process didn't start. I got it running at about 10:00am. Normal power was restored at about 12:00pm. We rolled back over to normal power around 12:30pm. The normal office voice lines are working again...
To recap, the Irving POP took a Internet outage from about 4:15am until about 8:15am. This affected the Irving POP users only. Boy, those two sentences sure trivialize all the work that was done. ;^) Sorry about the problem.
Power went out at the Irving POP. The battery backups lasted about an hour, but after that, the systems shutdown. We went to generator power at about 8:00am, and got the Internet equipment back online by about 8:30am. However, the NetIn.com main voice lines (for talking to us) - ok, actually, it's our phone switch, the phone lines are working, the phone switch is not - is still unpowered. Therefore, you cannot call us. More as it develops...
Updated software on our Irving DNS and mail servers. This caused a small (about an hour) outage for Irving POP mail users, while the updates were being installed, but no mail was lost.
SWB stopped routing calls to our Dallas POP's main ISDN and analog modem servers. I'll post their explanation as soon as I get it...
The explanation is that SWB didn't show this phone number as being assigned to us, so they dropped it from their switch (ie: removed it from the translation tables). Thanks to Dean Ostera for fixing this mess.
Someone appears to have unplugged one of our Dallas access servers. (Yea, I know, but it happens.) When they plugged it back in, things did not restart correctly. Anyway, we fixed it and it was back up by 8:00pm.
DNS reverse name lookups are not being routed to us.
SWB stopped routing calls to our Dallas Pop's main ISDN and analog modem servers. SWB explained that a "translation feature dropped out of the switch".
12:26am - 12:33am (7 minutes) GTE reset some of their xDSL circuits, resulting in a momentary outage.
1:04am - 1:08am (4 minutes) They did it again.
10:52am - 10:54am (2 minutes) And again. They are getting faster. ;)
11:06am - 11:09am (3 minutes) And again.
The Dallas POP mail server network interface was mistakenly shutdown. It was restarted at 6:30pm. There was no mail lost as the mail for the Dallas mail server was automatically routed to our backup mail server.
Deployed new web page analyzer software.
We took a 2 minute outage to some of the Dallas based services while I replaced an interface card on our main router.
And we enter the millennium running strong!
Well, we've made it through 6:00 PM CST, which is midnight GMT (our equipment is sync'ed to GMT with an offset), and no problems. On to 12:00 CST!
The Dallas POP main web server process died when doing month end processing. Thanks to Christian Farmer for letting us know. The month end processing problem has also been fixed.
Updated software on the mail server at the Irving POP.
While trying to establish a second connection to the internet, certain routes to NetIn.com were lost/suppressed. These were fixed, and now we have redundant connections to different providers (Verio and UUNet). It is ironic that adding a second connection to improve reliability caused a small outage...
Well, for those that monitor our stats, I guess it's plain to see that someone flooded one of our customers. While I'm not happy about this, it did provide a nice stress test of the system.
I applied the latest updates to the main web server and rebooted. All services are up and running.
Rebooted several of the production servers at the Dallas POP. One had a problem with the database server, the others were given kernel updates.
While I was rebooting and changing things, I moved back to the primary Internet interface (so our usage graphs should be tracking traffic again).
Customers on the 204.251.27.0, and 205.241.139-141.0 networks had routing disrupted (connections were up, but traffic was not flowing correctly). This was caused when the Verio routers stopped advertising Sprint derived class C addresses. Verio fixed the problem at 12:20pm.
One of the Dallas radius authentication servers did not restart when power was restored. It was not restarted until 4:25, so some of the Dallas dialin customers were not able to connect between ~3:00pm and 4:30pm.
We had an approx. 30 minute outage of all NetIn.com services when a technician bumped a main power switch, removing power to 2 racks of equipment.
I moved our main connection to Verio to another port, so the usage graphs are going to be off - or rather the graphs are correct, I'm just not graphing the new port...
We took about a half hour outage, as I misconfigured 2 interfaces from a remote location, cutting off access to the NetIn.com backbone. Yes, I'm supposed to know better, but when the big guys screw up, they screw up big! So, I had to make a quick trip to the Infomart, fix the misconfiguration, and do what I was initially trying to do. Anyway, things are working again. My bad, sorry.
SWB reported that someone had pulled the power fuses on their MUX equipment in the Infomart (a MUX lets you combine smaller lines into bigger lines, and separate bigger lines into smaller lines). Anyway, this is a rather high capacity device, so this took down much of the communications around the Infomart. All of our T1 connections, most of our ISDN PRI's, and all of our xDSL connections were affected. We reestablished connections to the Irving POP over backup ISDN lines, but service was rather slow as the backup lines were overloaded. The fuses were reinserted, and telco power and connectivity were restored at about 2:20pm. There is no word yet as to who removed the fuses.
At approx. 1:00am, we installed a new Cisco router causing an outage in Internet connectivity. We got some routing reestablished in about 45 minutes, but did not get complete routing reestablished until about 3:30am. This router is to facilitate our multihoming connection to UUNet.
For those of you who actually watch this stuff, some of the traffic statistics will be off until we get the software configured for the new router.
Moved all virtually hosted web site logs into their home directory so that customers can use their own custom software to analyze the logs.
GTE seems to be mucking about with xDSL circuits around the metroplex (this may be associated with the large service outage they had last week). This seems to create about a 30 second or so dropout of service. I don't mean to finger point, but this is not a NetIn.com problem. You need to call GTE (at 1-888-391-1234) to report the problem. NetIn.com HAS called GTE to report the problem, but it would be better if the problem report came from the customer.
GTE reports that they are having trouble with their ISDN and xDSL service for all of Texas. Most of our customers in our SWB and GTE area pop sites seem to be working correctly, with the exception of some intermittent weirdness... So far, they have no estimate when they will have the problem fixed.
We lost power to the main core router. This affected all of NetIn.com. You could reach the NetIn.com servers, but could not go beyond our network. The problem was fixed and the router restarted at 1:30pm.
Verio is having trouble at one of their major routers which is preventing traffic from reaching NetIn.com (and others). The NetIn.com network is working. Verio is working the problem, and I'm waiting on status from them.
11:30am - most of the routing is restored. There is still some trouble with the circuits through Irving.
11:45am - all routing is restored.
We don't know if there was a denial of service attack launched against one of our customers, or a process on a foreign network that went astray, that caused congestion on the Irving access server. We identified the source of, the problem and filtered them from the NetIn network. During the attack, the performance of the Irving POP was severely diminished. All users took about a 20 minute outage while the filter was installed on our main router.
Restored search engine on web server.
Restored primary radius server.
Backup power supply at the Dallas POP dies, taking down most of the equipment there. While waiting for the new BPS, I upgraded the software on most of our servers. This, of course, broke a bunch of things...
The new BPS was installed, restoring connectivity to Irving, xDSL users, and some ISDN users.
We rolled Dallas over to the backup radius server and restored service to the rest of the ISDN customers.
Restored web server.
Restored secure web server.
Restored commerce server.
Started using new billing system.
Routing for the 204.251.27.0 network was reestablished. All is well again...
Routing for all but the 204.251.27.0 network (which is all analog modem dial in, most of Irving ISDN, and some of Dallas ISDN) is now routing correctly. The 204.251.27.0 network routes are not yet being advertised correctly...
Verio tried to reboot their backbone routers (I don't know why) and are having trouble reestablishing routing... I'll post more as I know it.
The Dallas ISDN authentication server and the main web server crashed, maybe do to some weird power glitch during the storms. This affected the Dallas ISDN dial in users and virtually hosted web sites (including our own).
Installed new core router.
The phone and power service at the Irving POP was disrupted for about a half hour. While the Irving POP has backup power, the phone lines didn't work. I believe this was due to the storm, but I'm not sure (somebody may have hit a phone/electricity pole?). Both were restored at basically the same time. This only affected the Irving POP.
Verio is having trouble at one of their major routers which is preventing traffic from reaching NetIn.com (and others). The NetIn.com network is working. Verio is working the problem, and I'm waiting on status from them.
At 9:30, Verio fixed their router. I have not yet heard what the problem was.
While making a backup, the mail server crashed at 8:08pm. It was fixed and put into production again by 9:40pm.
Our upstream is having trouble routing some of our addresses. I don't yet know much more about this yet...
This turned out to be part of a larger routing problem upstream of NetIn. It was cleared by about 5:00pm.
Some GTE ISDN customers are having trouble dialing to our Dallas POP. We (NetIn/GTE/SWB) believe it is a call routing problem, and they are working on it.
Reverted the Irving ISDN access servers back to the previous release level, as analog users were having trouble with this revision of code.
Loaded new firmware on the Dallas and Irving ISDN access servers. In both locations, ISDN users took a couple minute outage while the servers rebooted.
Routing to and from NetIn.com has died in the Verio network. Verio has a router problem and they are working the problem... Routing was restored at 9:30am. Verio reports that an ethernet interface went down, and was repaired.
Installed and configured equipment to handle xDSL service. We can now support xDSL services for GTE customers.
One of our DNS name servers had trouble and was causing performance problems for customers using the Dallas POP. It was fixed at about 10am.
Found and replaced some bad wiring on a PRI at the Dallas POP. This was causing some unexpected disconnects and some other strange behavior.
Performed major hardware and software updates to most of the server machines.
Rebooted our access servers as a result of the Verio troubles (buffer overflows of queued packets to networks that were sometimes available, and to stabilize dynamically built routing tables). Everything is now running normally.
Verio stopped advertising routes to the same 2 class C addresses. Verio restored the routes at 6:50pm.
Verio stopped advertising routes to 2 of our class C addresses. Verio restored the routes at 5:00pm.
In some areas around town, 10 digit dialing (using the area code even when calling a number in the same area code) took effect. If you're having trouble dialing us, you might add our area code to the number you dial. Our Dallas POP is in the 214 area code, and the Irving POP is in the 972 area code.
Routing to the missing class C has been restored.
Verio dropped one of the NetIn.com class C addresses. They have been informed of the problem. This class C also has the NetIn.com primary and secondary name servers. If you're having DNS trouble, you can use our 205.241.139.4 name server.
The router is working again. Our Verio upstream provider had trouble with some interface cards on a core router.
The next router upstream of NetIn.com is having trouble.
Replaced first router upstream of NetIn.com.
Loaded new firmware on the Dallas Access Servers. Rebooted servers.
Loaded new firmware on the Irving Access Servers. Rebooted servers.
Well, I promised to write about the good and the bad, so here is some more bad (today was not the best of days...) First, the continuing saga of Verio's integration of Onramp and NKN led to a small service outage due to a routing issues. This was about a 5 minute outage. Then there was a larger outage starting at 2:44pm and lasting until 3:40pm. The problem seemed to be caused by some ISDN signaling issues.
The Dallas radius server crashed while performing some month end processing. It was restarted by about 9:00am.
Rebuilt the kernel on the primary mail server to optimize its performance. Rebooted the primary mail server at 12:30pm.
We have been having some very strange and intermittent trouble with our primary mail server. It appears that the CPU fan died, and the CPU overheated. This also caused some file corruption, so at around 11pm, I took the machine down, replaced the fan, reformatted the drives, and installed the latest OS... This is really a bummer, as this machine had been running for 388 days before this happened!
Our server that collects performance data ran out of disk space and wigged out. As this machine is not critical to operations, it was not fixed until about 10:30pm. This is the reason for flat spot in the performance graphs for today.
Well, the latest and greatest software on the Irving Access Servers turns out to have problems with Farallon routers (any others?). I've moved down firmware releases to the same version as the Dallas POP. This caused a couple of minute outage to Irving, but the Farallons work again. Thanks go to George Vagner for letting us know of the problem. Sorry about all the reboots. What a wasted effort.
The update didn't go as smoothly as I had hoped. The Irving POP had a small glitch but otherwise went well. Irving Access Servers now have the latest and greatest firmware.
The Dallas POP is another story. The updates did not install as advertised and had incompatibilities with some existing hardware. I stepped down the releases, without success until I was back to where I started. This caused an outage of about an hour (for reboots with each release I tried) instead of the planned 5 minutes.
For those who like to have notices, I will be updating the firmware for our ISDN and analog access servers this weekend at both the Dallas and Irving POP sites. This means there will be a small (5 minute) outage for ISDN and analog users only, when the machines reboot. Hopefully this will be done late at night and no one will even notice. ;^)
Performed software updates to the main and some secondary web servers.
Put the secure web server into production.
Performed software updates to the database servers.
Installed and configured new secure web server.
Rebooted router2 (Irving POP) to make some changes take effect.
New PRI is installed, tested, and put into production.
The primary web server was accidentally disconnected from the LAN. It was restored at 3:40pm
One radius server process died. This prevented some dial in accesses until it was restarted.
A T1 span used in the Dallas POP for dial in and some dedicated users went down. This only affects the users of that T1, all other services are working. SWB said there was a jumper problem in a cross connect.
A T1 span used in the Dallas POP for dial in is down. We don't yet know the reason. SWB has been dispatched. Irving access is not affected, some of the Dallas access is not affected, web servers and Internet connectivity are not affected. I'll post more as I know it...
Built and installed a new kernel on the main web server. Running the new kernel required a reboot...
A spam house found a way to have NetIn relay for them, and clogged our mail and DNS system for about an hour before I could get them shutdown. Performance of NetIn overall suffered during this time. This particular breach has now been closed.
The main web server crashed and was rebooted. I don't yet know why this happened...
The Dallas POP users usage report programs are now configured. They really need a little more testing, but I'll post what I have. Please let me know if you have any trouble with them.
Electrical work is being done at the Dallas POP, and the backup power was taken out of the circuit. We took small reboots as circuit breakers were mistakenly flipped. A small BPS was put back into the circuit until the wiring, including the large BPS, is done.
Starting to move traffic through new equipment (latest and greatest Ascend and Cisco wiz bang stuff. ;^) Please forgive our little glitches as this transition takes place.
In the next few days we will also be moving misc. services (like the web and DNS servers) to new equipment too. Expect some of the web services (like usage reports, and traffic statistics) to be unavailable for a short time, until these services can be reconfigured. Thanks
We have had some strong storms that has downed some lines in the area. GTE is trying to fix these lines, but is causing trouble on our dialup lines. They hope to be finished by 2:00pm. If you get disconnected, please just reconnect, as they sometimes just touch the wrong wires...
GTE has restored all lines. Testing is finished. I would like to extend a personal thanks to the GTE field team of William Ford and ??? for their hard work on a Saturday and after hours - thanks guys.
GTE has about 25% of our incoming lines down for testing - but there are plenty of lines available. If your connection drops unexpectedly, then they probably connected to the wrong lines. Please feel free to reconnect.
GTE is reporting trouble in the Plano office. They are supposed to be adding capacity to the 2 Plano POPs, and hope to be done by today or tomorrow. As of 6:30pm, they have no status, so I would expect this is a tomorrow thing.
Yesterdays GTE trouble in Irving was due to routing problems where GTE meets other carriers. GTE reports that this is now fixed.
GTE is reporting metroplex (state?) wide trouble with ISDN. If you are having weird or intermittent trouble connecting to netin via ISDN, you might want to check that your ISDN TA is connected to the phone company's switch (check to see if the D channel is working). If it isn't, then you should probably call the phone company. GTE's ISDN national support number is 1-800-555-6635. You might want to let us know too.
Some type of surge on the main Internet connection took down that connection, the DSU on that line, and our main web server. The backups instantly picked up the load, so there was no down time. The DSU cycled itself and the main Internet circuit reestablished itself. I manually rebooted the main web server, and switched back to all primary equipment by 2:05pm.
The main name server process died. The secondary name server was working as normal. The primary name server was restarted at about 8:15pm. The primary and secondary name servers are scheduled for software updates later this week or early next week.
The CSU on primary lines seems to be going bad. I replaced it with a backup unit around 5:00 and tested the lines. Things look good again, so I switched back to the primary Internet circuit. Again, there was no downtime for users.
I diverted our Internet traffic to the backup lines to do testing on the primary lines. There was no downtime for customers.
The main name server process crashed. The secondary name server was online, so we did not notice this for a while (1:45pm). Those that did not have a secondary name server configured may have noticed a problem. Anyway, as of 1:50pm, the primary was restarted, and all is working well again...
At about 7pm NetIn.com lost connectivity (including the backup lines) to our upstream provider. While the NetIn.com network was fine, customers could not get to the Internet. The upstream provider's backup power supply malfunctioned (and burned out!), putting the upstream provider in a blackout. They moved the equipment off of the faulty UPS, and onto "normal" power, restoring Internet connectivity at 8:10pm. This equipment will take another small outage when power is again routed through the repaired UPS.
We had more problems with the Dallas POP router4 (it stopped routing for about 5 mins), so we turned off OSPF routing on the unit. The router has to be rebooted for this to take effect, This turned into about a 20 min outage while the routes were manually restored. Again, this only affected the Dallas POP.
Moved router4 back to a previous revision of the OS - had to reboot the router.
Don't know how I did it, but I killed the web server process on our main web server. I didn't notice this for about a half hour.
The Dallas POP experienced a half hour outage do to changes required for the migration to the Infomart.
Some ethically challenged people took advantage of NetIn's anonymous ftp facility which degraded performance slightly for a while. The main thrust of their vandalism happened from 8:00pm until 8:45pm (which is when I stopped it). Do to this incident, the anonymous ftp storage facility is no longer available. Please let me know if this inconveniences you.
Upgraded the software on router4.
Sync'ed all NetIn clocks to stratum 3 clocks.
MAE West replaced some equipment on the 3ed and 4th. Things seem to be stable now.
MAE West is having trouble, which is causing saturation of the other MAEs and route flapping. In other words, the performance of the Internet is going to be bad until MAE West and routes stabilize.
Moved WAN at the Irving POP back to the T-1 span due to instabilities in the ISDN routing equipment. The manufacturer was called and is involved in solving the problem.
Basically, the 10/28/97 problem happened again, the backup lines were established, but did not route. This caused an overnight outage at the Irving POP, which affects only dialup customers.
Disaster struck. Before I got the backup lines in place (oh - the new 10Mbps connection means new equipment, new lines, new configurations, and of course new problems!) the new router went down. So far, I'm not sure of the exact cause of this, but I thought I'd post something so you would know what happened.
I've been real busy lately (and grumpy!) configuring our 10Mbps connection to the Internet, and a second POP site (one in GTE land, one in SWB). The systems seem to be pretty stable now (had a little routing problem earlier in the day). There will be some account changes, but I will let you know if you're affected. And, of course, all the monitoring software now needs to be reconfigured, along with web pages that need to be updated. A mans work is never done...
I replaced the old "get-router-stats" program with the mrtg program. All the history was lost, but I like the new format a lot better. In our typical open kimono style, you can view our stats.
Added an additional 64MB of RAM to fangorn - so fangorn was rebooted. This was needed due to the size of the data files for usage reports.
The primary Internet connection went down for about 5 minutes. I established the secondary connection, but the primary was restored a couple of minutes later. We suffered about a minute of actual outage while detecting the problem and rolling over to the backup.
Noticed and fixed a bug in the usage report where requesting information for September gave data for August. This is now corrected.
Moved the new node, fangorn, from test into production. Fangorn is a dual Pentium Pro system sitting on a 100BaseTX segment, and uses 10,000rpm Ultra SCSI fast and wide Cheetah drives.
Verio was performing some upgrades and things went bad. We lost connections to all the mae's at one time or another, but problems seemed to center around mae-east. Things were stable again by 9:45pm.
Discovered that bagend's clock was drifting. This is important as bagend tracks things like WAN usage. Anyway, I fixed it - but the WAN usage reports probably are not accurate for days before 8/6/97 (they show usage being too high, as the clock was running slowly).
Moved ethernet based equipment onto our new switched 100Mbps backbone and to our new 10Mbps switch. Most likely, you will not notice a huge difference, as the previous LAN equipment was kick'n butt. This just makes sure we have the infrastructure in place to support growth.
Reestablished primary WAN connections. There were a combination of problems starting at GTE's Cascade switch, a DCX, and finally to my DSU. The problem ticket is closed, but I'm going to watch this circuit for a while.
Lost protocol on primary WAN circuit. Rolled over to secondary circuit.
Moved equipment around. Two routers were rebooted (not at the same time! ;^) when moving them to new UPS's. There was a small (less than 15 min at around 2am) outage while I rerouted some lines.
A beta version of a web based application to check your usage is now available. This report is password protected, and is up to date as of the last time you logged out (ie: it does not contain information about your current session). I've also created a page so that you can change your password,. Please let me know if you have any problems with these programs.
I've also put up a page graphing our main WAN connection usage - for those who are interested in such things. It's updated several times a day, so remember to press reload...
Well - it was one of those days. It started with MAE-West's Giga-Switch going down (that means loosing a T3 for us) for several hours. We still have 2 other T3's going to different NAP's, which should have handled the load, but my provider took the opportunity to upgrade/reconfigure some ports, and routing got screwed up. I still don't have the full story... But MAE-West is back and so is normal routing/operations.
It's been a week, and nobody has complained of a "cause code 18". I think we have that problem solved - thanks to Roger Young (of GTE) again!
GTE finished reconfiguring the primary hunt group to help with the intermittent ISDN cause code 18 problem.
Established alternate routing and upgraded the software on the Cisco router. Rebooted the router.
Established alternate routing and upgraded the software on both Ascend routers. Rebooted the routers.
We've added more dial-in lines. We were not running out of lines, but we're growing and want to make sure that we do not run out of lines.
Some of our ISDN customers have notified us that they are experiencing a "Connection Terminated - Cause Code 018" message when they try to connect. As NetIn.com has ample capacity, this has to be a problem with the local exchange carrier (GTE or SWB). I have opened trouble tickets with GTE, but would appreciate you letting me know if you get this error.
The GTE switch upgrade went without a hitch! Way to go GTE!!! (I would have lost that bet!)
GTE recognized and fixed their ISDN call routing problem to Richardson.
GTE has lost call routing information to Richardson. They are working on the problem.
Certain logins were prevented from logging into the system when a router refused to route to the "0" subnet (which is where an authentication server resides). Routing through an alternate router was established, and the troubled accounts are able to login again.
As some users use a different authentication server, we had customers logging into the system and I did not notice any problem. Our customer Terry Sullivan was the first to report the problem. Without this information, who knows when I would have noticed. Our sincere thanks to Terry!
Replaced a possibly intermittently failing DSU, which I suspect being part of the weekends problems.
We had sporadic routing problems throughout the day. I found and corrected some incorrect routing table entries, and everybody seems to be happy. This should be the end of last nights fiasco (knock on wood).
A customer called and reported that the name server was down, so I tried to reset it. It turns out that the named server was fine but I discover that while the main Internet WAN connection was up, the line protocol for that connection was not (this is not something that normally fails, or something I regularly check). As I could reach my router, I assume it is a problem on my provider's side, and call to report the problem. While they are trying to fix their router, I establish alternate routing over a backup line, but while doing this, the primary WAN connection is bouncing up and down. This really screws up my automatic failover routines, so my arp and routing tables are totally screwed up. Great, now I have to hook up consoles to the routers to be able to configure them, because all the devices on the main ethernet segment are so confused that I can't telnet... This story goes on, but as I'm tired, I'll shorten it and say that things finally stabilized at 2:30am. Time for bed.
The main WAN connection went down. Alternate routing was established, but there were some routing problems on some new segments. The main WAN circuit was "bouncing" up and down for a while, and GTE could not tell why. The connection became stable by around 2:00pm, and I reestablished"normal" routing.
Updated the OS on the Ascends, enabling NAT and multicast support. However, I forgot to restore the user profile information before rebooting (Oops), so one machine took about a 10 minute service outage.
The name server lost sync, and would not re-sync. I restarted the name server at about 5:15pm, and all is well.
The main WAN connection went down around 3:35pm. Alternate routing took effect across backup ISDN lines, and a call was made to GTE. GTE fixed the normal WAN connection in approx. 15 minutes. Normal routing was back in place before 4:00pm.
GTE ISDN Guru Roger Young did the impossible by putting our lines properly into our hunt group. Many thanks Roger!
Discovered that GTE had 40% of our "dial in" lines misconfigured such that the lines were not usable.
Launched a new pricing program to reward efficient users with lower prices.
At ~6:15pm, NetIn.com's main Internal LAN suffers a total blackout when an ethernet card on a backup server (bagend) went faulty. The problem was fixed, and the LAN was restored by 6:40pm, for a total outage of a half hour. Several servers, routers, etc, were rebooted in the diagnosis process. It sort of makes you wonder about the necessity of backup machines when a backup machine causes the outage!
A stale and unremovable file lock on a cdrom file caused me to reboot rivendell (it seemed like a good time, as no one was on the system).
Created this page.
Reconfigured the failover software on the backup server.
Added the ability for our customers to have their own cgi-bin directory and programs. These cgi-bin programs run with the owners permissions (with power comes responsibility...) See the user setup for instructions.
Upgraded the OS and support software on our backup server. For the curious types, the backup server is now running Slackware distribution 3.1.0, kernel version 2.0.27.