Suspend and resume, revisited…

September 22nd, 2007

On my HP 6110 laptop, it takes about 5 seconds to suspend, another 10 to resume, and 5 more to get an IP address. 140,000 suspend/resume cycles would take about a month. Sure a good thing we aren’t trying this on a regular laptop….

You know, suspend/resume is *really* hard….

September 22nd, 2007

Like eating a jumbo bag of potato chips from Costco, making suspend/resume *really* work is *really* hard.

First you get the hardware to appear to work properly…. That’s a big bowl right there…

Then you have to go get a refill, on Linux drivers…. The next big bowl… But mostly just work.

Then you get to fix some pretty obvious hardware problems…. Such as minor pops from the audio amplifier when power is reapplied. A few more chips….

Then you get to do some engineering, such as Marcelo’s work on lazy resume on USB, Dan and Marcelo’s work on the wireless driver and its interactions with the wireless firmware, Javier’s and Marvell’s work on the wireless firmware, Chris B’s endless testing, back to the audio driver, Andres and Jaya’s work on the audio driver, and so on… Another big bowl…

After all that, you start seeing Linux suspend and resume hundreds or thousands of times. At this point, everyone would consider this “good enough” for a regular laptop, and move on. We’ve learned better.

For us, it’s not good enough… We’re going to be suspending between every page rendered on the screen when reading books or the web, so thousands of times/day is entirely expected… And someday, we’d like to even be able to suspend between long keystrokes. So you realize you need hundreds of thousands of suspend/resume cycles on many machines (due to variation between hardware), to have a rock solid system, and even the testing starts taking a long time.

So then you get to chase the bugs that only happen after hundreds, or thousands, of suspend/resume cycles, not knowing if they are hardware or software…. Some more chips buried in the bag…

So over the last few days we’ve been chasing a problem in our power supply causing failures, and it looks like we have a bug fix in hand, now being tested (of course, doing the testing now takes days on a bunch of machines). Even before this fix, some systems were going more than 140,000 suspend/resume cycles before failure, but some machines were showing problems faster (yes Petunia, these are cycles while Linux is running, not just hardware tests). In a few days, maybe we’ll have some more interesting data, but it now takes quite a while to get the data.

It will be interesting to see how many more chips we have at the bottom of the bag. Not many, but I’ve learned to never say we’ve eaten the whole bag. This bag is sort of like a Hermoine’s Hogwarts bag, that is much much bigger inside than outside. There may be a few rattling around in a corner someplace we haven’t found yet.

My great thanks to (in no particular order): Wad, Marcelo, Chis B, Javier, Jordan, Andres, Mitch, Arnold, Gary, Richard, Terry, Mike, Kim, Ronak, Victor, Jon, …. I know I have missed a few potato chip consumers particularly inside Cozybit and Marvell; my apologies to them…

Busy, busy, busy….

September 4th, 2007

Getting the software ready for OLPC’s launch is an “interesting” job; my hair grows pointier by the day…

But it won’t stop there; this is just a beginning. The laptop.org web site has our job openings (you can help the remaining few hairs I have sharpen to a razor fine pitch with many of them).

In many ways, I think the challenges of deployment are more interesting than those of building the laptop. While our team has innovated in many ways, in both the collaborative nature of Sugar, in our power management and elsewhere, the challenges of deployment are much more varied and, I predict, will bring very interesting system challenges forward on our agenda. Scale, scale, and scale are our challenges.

WiFi worked; cell phones failed.

May 25th, 2007

I was recently stuck on an elevator for over an hour and a half.

Ironically, cell phones would not work (either mine, or the guy I was stuck with), but there were (at least) two open access points we could use to get help.

Turned out, there were two entrapments at the same time, and so they came and let the other elevator out, and then left. Good thing we were able to keep pestering people for help…..

Ah, those wonderful people in the press…

May 1st, 2007

Some reporters try hard to get it right, some don’t…

Here’s the straight scoop about the changes in the OLPC specification:

Something over 2 months ago, AMD visited and told us we could get the Geode LX-700 at roughly the price we could get the GX, in large part because a binning for the 700 would cause the yields on the parts to be very high. Our tests showed the LX to be roughly twice as fast as the GX, and sometimes more for interpreted languages due to its larger cache, and with a much better graphicsengine: we were hurting on depth conversion in alpha blended graphics. Of course, we sure wish AMD had let us know this some months earlier, but such is life.

This then begged an interesting question. We had been planning a “Gen 1.5″ for sometime next year, to upgrade the processor, RAM and flash memory.

At a launch countries meeting in early March we presented two options:

  1. Plan of record, Geode GX, 128 meg of RAM, 512 meg of flash
  2. Accelerating Gen 1.5, Geode LX, 256 meg of RAM, 1 gigabyte of flash, incurring about a one month delay in schedule to accommodate another spin of the logic board

The unanimous choice of the launch countries was, though it would cost slightly more initially, that they preferred to have one nicer system over a longer period of time. Logistically, this makes their lives (and ours) much easier and we get to concentrate much more on Gen-2 to really get the cost down to where we want it. I was in the room when this decision was taken, and it had nothing to do with Microsoft. That we always have said “$100 in late 2008-2009″, always seems to get lost in the press.

I’m also glad that Microsoft likes the SD card (which they need); note that that will cost much more than Windows, and that Linux doesn’t need it ;-). Again, the primary push for it were two events, taking place last summer, little of which had to do with Microsoft:

  1. We found the Geode GX had a lousy NAND flash controller, so bad we had to build an ASIC to have a system with decent performance. Those of you who tried to use the ATest boards with the on board flash know how bad it was. Once we had an ASIC, the incremental cost of the SD slot (which the countries had been asking for) was just about the cost of an SD connector. This is also why a camera became feasible; it was the cost of the sensor; amazing what cell phones have done to camera prices :-). Alternatives to doing the CaFE’ ASIC cost just about as much as the ASIC did, but did not provide the additional functionality. Volume is a wonderful thing.
  2. Feedback from the launch countries was (correctly) that with the current drop in flash prices, having some way to augment the on-board flash cheaply would increase the useful lifetime of the system. They were right, and we had a way to accommodate them due to CaFE’.

My last topic is about freedom: “They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” - Benjamin Franklin:

  • Some have suggested that we deliberately restrict what software can run on OLPC: this goes fundamentally against the grain of what freedom is.
  • We’ve made hardware available to many alternate operating systems, including Minix, ReactOS, Microsoft, etc. This is partially selfish: we’d like to find any hardware problems sooner rather than later, and different systems exercise machines differently, and partly because having an open platform is so essential. Just as 10 years ago we had no clue Linux whould have the importance it has today, we don’t know what will be important in 10 years. Freedom, however, is eternal. You don’t get to choose if you mean what you say. We mean what we say.
  • Microsoft gets what we do with others in this situation, no more, no less: access to the information we have. On several occasions, our answer was: “Use the Source, Luke!” and a pointer to our repositories.

Lastly, I will note that free and open source software is one of the five tenets of OLPC, worked out jointly with discussion with the countries.

Mitch Bradley reports….

April 23rd, 2007

“Gary at Quanta build a little relay gadget that can activate the power button remotely, under software control from another XO. Last night he ran an overnight suspend/resume cycle test, at 1 seconds suspended, 1 second on. It ran for 14,000 (fourteen thousand) cycles without any failures.”

“Now we are trying with a longer off time, to catch possible “memory rot” problems, if they exist.”

Would that my HP laptop did even 1/1000th as well…. Of course, given it takes 11 seconds to resume at all, this is a non-starter….

Down, down, down…

April 16th, 2007

Marcelo Tosatti gold me at FISL our resuem is at 160 milliseconds… :-)

Fast resume; beginning to come together….

April 8th, 2007

We have the fast resume path running on the OLPC BTest-2 hardware now. Mitch Bradley’s OFW starts the process, and pokes some of the slower hardware to start to wake up in advance. We don’t yet know if we’ll need any of the tricks that Mark Foster talked about at the Power Management Summit a year ago, though I suspect not. This is still early days. Almost everything is now resuming correctly except SD, and Pierre Ossman is looking into that. Marcello Tosatti has USB resuming as well; he had to add a work queue to schedule some of the resume work later. The DCON is working; your system can have the screen on while fully suspended.

Chris Ball’s measurements of resume time is now 223 milliseconds seconds. The best the hardware should be able to do is believed to be about 63ms (according to Jordan Crouse). Our next goal is to make resume absolutely reliable (since we intend to do it so much more than most people; we won’t work on speed until we reach that goal. Then we’ll go back to resume performance.

We then get a flurry of process stuff happening at user space, as hal/dbus still believe one of these sleeps should wake up random hardware (like dhcp’ing again on the wireless, which in our case has been up the entire time… Hmmm. That begs an interesting question around dhcp lease expiration we’ll have to think about.).

Contrast this to my HP laptop, which takes 11 seconds to resume from RAM, despite being a much faster processor.

FISL again this year…

April 6th, 2007

I’ll be flying south to Porto Allegre Monday night for the FISL conference; Wednesday should be interesting, as I’ll get a chance to see one of the first trial deployment schools that also happens to be there. At this hour, 6 days in advance, FISL has more than 4600 registrants. A year ago, with only industrial design models, we were pretty mobbed; I can’t imagine what it may be like this year. It’s a long flight from Boston to PA, but at least only 1 hour time difference (east of Boston, for those of you geographically challenged).

The point about millions of machines and bugs in the previous blog posting I made is that I need to find someone to worry about QA and testing in OLPC’s interesting environment at the cross roads of free software, governments, and somehow inspiring self organization of QA and testing in deployments all over the world. Knowledge of both conventional commercial and free software QA processes highly desirable. If you feel up to the challenge, let me know.

Millions…

March 29th, 2007

Chris Ball came up with a good line for the OLPC project today:

“When you have a million machines, you start hitting one-in-a-million bugs (so we get 10).”

Keeps things in perspective….