01 Feb 2012

feedPlanet Twisted

Jonathan Lange: Simple made easy

Rich Hickey did a great talk at Strange Loop called "Simple Made Easy". You should watch it.

When I tried to explain the talk to someone, I stumbled a lot and it was obvious to me that I didn't really understand it. So I'm going through it again and turning it into a blog post, purely for my own gain.

This is roughly the first half of the talk. Not much of my own analysis or opinion is inserted, and I've pretty much stuck with Hickey's illustrations and phrasings. Thus this post is pretty derivative. Oops.

Simple vs Easy

"Simple" means one thing, "easy" another. Simple is the opposite of complex. A thing is simple if it has no interleaving, if it has one purpose, one concept, one dimension, one task. Being simple does not imply one instance or one operation: it's about interleaving, not cardinality. Importantly, this means that simplicity is objective.

Easy is the opposite of hard, or difficult. A thing is easy if it's near to hand, if it's easy to get at (location), if it's near to our understanding (familiarity) or skill set or if it's within our capabilities. This means that ease is relative.

Speaking English is dead easy for me, but that doesn't mean that speaking English is intrinsically simple. I find French quite difficult. Little French children speak French all the time, and there's always a part of me that thinks, "Boy, those kids are clever, being able to speak a foreign language at that age", but that's silly. It's easy for them, it lies near to them.

This distinction between simple & easy is good one, and is useful in all sorts of areas. But how does it relate to software?

Constructs vs Artefacts

As programmers, when we make software we are working on constructs: source code, libraries, language concepts and so forth. Rich contends that we focus on the ease of use of those constructs: How many lines of code? How much boilerplate? Will new developers be familiar with our technology?

But all of this is secondary. What actually matters is the artefact, the running programs that users actually use. Does it do what it's supposed to do? Does it do it well? Can we rely on it working well? Can we fix problems when they occur? Can we change it? You know, the interesting problems.

Thus we need to be assessing our constructs - our code, our technology choices - based on the attributes of the artefacts that we'll create, not based on the experience of typing code in.

Limits

We can't make something reliable if we don't understand it. And, actually, everyone's understanding is pretty limited. We can all only hold a small number of things in our head at once.

When things are complex, many parts are tied together by definition. You can't pull out just one piece and consider it because it's intertwined with other pieces. This creates an extra burden to understanding a system and thus makes it difficult to reason about the system.

You do need to reason about a system, both to know what to change and to be able to do so without introducing defects. Tests, refactoring, rapid deployment and all that are great, but to make a change to the system safely & without fear still requires you to be able to reason about it. Every bug in your product that was found in the field passed the type checker and passed all of the tests. Your type system doesn't tell you what change to make next in order to get the software you want any more than guard rails on a highway tell you how to get to Grandma's.

Speed

Focusing on ease and ignoring simplicity means that you'll go really fast in the beginning, but will become slower and slower as the complexity builds.

Focusing on simplicity will mean that you'll go slower in the beginning, because you'll have to do some work to simplify the problem space, but making sure that you only have intrinsic complexity means that your rate of development will remain at a high constant.

There are no actual numbers for this.

Complicating constructs

Many complicating constructs are available, familiar, succinctly described and easy to use. But none of that matters to end users. What matters is the complexity they yield. This complexity is incidental, it's not intrinsic to the problem.

If we build things simply, then the resulting system is easier to debug, easier to change and easier to understand.

Compare a knitted castle to a castle made of Lego. The knitted castle might have been great fun to make, and might have been really easy if knitted using a loom and cutting edge knitting tools, but there's no way that it's easier to change than a Lego castle. It's not about the ease of construction, it's about the simplicity of the artefact.

How can we make software easier?

Well, we can install it to make it easier by location. We can learn it and try it to make it easier by familiarity. We can't do much about our capabilities though. If we want software to be easier to comprehend, we are going to have to bring it down to our level. We have to make it simpler.


Take Lisp as an example. It's hard for many people because they don't have a Lisp installed, or their editor doesn't support paren matching, but they can make it easier by installing a Lisp and getting a plugin for their editor. It's also hard because it's unfamiliar. Who'd have thought that parens could go on that side of the function? But you can gain that familiarity quickly enough.

But parens in Lisp are used for functions and for grouping data. That's hard to get your head around, and that's because it's complex. It braids together two distinct notions.

01 Feb 2012 1:50pm GMT

30 Jan 2012

feedPlanet Twisted

Jack Moffitt: The Numbers Behind the Twitter Data Silo

The dark future of search is being foreshadowed by this Twitter vs. Google fight. The latest Twitter volley at Google is this quote (seen on GigaOm) from Twitter CEO Dick Costolo:

"Google crawls us at a rate of 1300 hits per second... They've indexed 3 billion of our pages," Costolo said. "They have all the data they need."

There's no doubt that 1,300 hits per second is a large number, but let's put that in perspective:

For part of 2010, Google was perhaps able to keep up with the stream at 1,300 requests per second. Somewhere between February and June, the average volume of tweets outpaced them.

Let's assume that they kept pace until June 2011, and that on June 1, Twitter went from somewhere in the range of 1,300 tweets per second to their reported 2,300 tweets per second. Google is 1,000 tweets behind per second.

By the end of the year, Google missed 15.5 billion tweets. They are two months behind if they didn't skip any, and the tweet volume did not increase. But it did increase by 25% or so by October, and surely it has grown more since then.

If Google has only indexed 3 billion pages so far, they have approximately 12 days of tweets at current volume. It's pretty hard to rationalize the 3 billion pages number against the 1,300 per second number. Was Google indexing at a much slower rate before? Did they not start until a few months ago?

Of course Google may be getting multiple tweets per request, perhaps by crawling the timelines of important users. But this means that they probably get a lot of requests that don't give them any new tweets, or else the timeliness of the data is poor.

No matter how you slice it, it appears Google would be unable to keep up. Even if they were keeping up now, Twitter's growth probably sets a time limit for which keeping up remains possible.

Perhaps Google is super clever, and can index only the right tweets. I think that it's more probable they have "enough" data to surface results for the super popular topics, and miss nearly everything in the long tail of the distribution. I expect that this adversely affects search quality, which one suspects is a high priority for the world's best search engine.

Google is no saint. They are just as guilty of the same data hoarding. If you ran these numbers for YouTube indexing, I think you will find the situation is much worse. I imagine that most of these data silo companies purposefully set their crawl rates too low for anyone to achieve high quality search results.

In the case of Twitter, the end result for users is even worse because Twitter's own attempts at search are terrible and are getting worse over time. At least Google makes a decent YouTube search, even if no one else can.

Even if Google could get all the tweets, they still would have very little to no Facebook data. I still think the best strategy in this situation for them is to create their own social data and use that instead. It's a tough road, but they seem to have little choice.

In the end, it's not about Google or Twitter or Facebook, but the stifling of innovation and competition around data. We can only hope that some federated solution or some data-liberal company wins out in the end.

30 Jan 2012 8:47pm GMT

26 Jan 2012

feedPlanet Twisted

Moshe Zadka: Why programmers are concerned about copyright law [Addendum]

As we saw in the last edition, programmers are concerned about copyright law because the only way to universally enforce copyright law would be to take away all universal Turing machines and make sure that we cannot control them. How bad would that be?

The previous episodes focused on laws of math (generality of computation) and physics (quantum mechanics makes it easy to build general purpose computers). Now we concentrate on laws of economics in the industrial age. In the beginning of the industrial age, factories lowered the cost of goods dramatically from before - a textile factory is much more efficient than a tailor making suits, and so it is much cheaper to clothe ourselves. However, as factories evolved, it became more and more important to drive the cost even lower - whereas before the competition was with other tailors, now the competition was with other factories. So, economies of scale and efficient production lines developed. Soon after that, supply chain management and moving factories to the most cost-efficient places were developed.

Let's say that you want to make T-shirts with political slogans. You've got your slogan writers, huddled and coming up with good slogans. They're a sunk cost - you'll pay them the same no matter how many T-shirts you make, or how much they cost. Now you build a big factory to make T-shirts, and print slogans on them. The company next door specializes in T-shirts with band logos. They build a big factory to make T-shirts and print logos on them. Someone realizes there is money to be made in this scenario - build a factory to manufacture T-shirts, and send them in large crates to places that will print stuff on them. Because their factory is bigger, their cost for making T-shirts are lower, and they pass some of the savings onto you, the political slogan company. In turn, you pass some of the savings onto your customers, and everyone is happier.

Lesson: When economies of scale hit, you want to try and buy standard components for the product you manufacture.

Now suppose you build a car. The anti-lock brake system need to figure out when the car is sliding, and start "pumping" the brake. You can build a system that will contain gyroscopes that attach to some handle that pulls a wire that pumps the brake. Or, you can

Note that in this scenario, we source some standard parts. Even though the parts are more complicated, it turns out that we save a lot of money by using standard chips and accelerometers. Note that putting the software on the chip is just copying it to some standard storage device - extremely cheap. ABS brakes become cheaper, and people die less. Overall, a good thing.

The same sort of logic causes a lot of things that used to be done with handles, levers and pulleys to be done with a general purpose chip and some software. Your car is full of them. Your microwave has some. Your television has some. Each of those devices is basically a computer connected to some strange peripherals (brakes, microwave-rays or screens). In a world with strong copyright regulations, those devices' software are locked to us, even though we own the device. Where darkness goes, evil deeds covered by the darkness soon follow. The police in Evil Regime Country wants to make sure nobody can run from them. They mandate all cars sold in ERC must have, in their software, a special switch that when a certain bluetooth signal is sent (many cars nowadays have computers connected to bluetooth receptors), the software controlling the automatic gear system makes sure the car will not go over second gear. If the ERC police are smart about using this signal, nobody will ever know.

In a world with strong copyright protection, big corporations control your life - and they are corruptible, and if darkness holds, corrupt.


26 Jan 2012 5:48pm GMT

Thomas Vander Stichele: GStreamer 0.11 Application Porting Hackfest

I'm in the quiet town of Malaga these three days to attend the GStreamer hackfest. The goal is to port applications over to the 0.11 API which will eventually be 1.0 There's about 18 people here, which is a good number for a hackfest.

The goal for me is to figure out everything that needs to be done to have Flumotion working with GStreamer 0.11. It looks like there is more work than expected, since some of the things we rely on haven't been ported successfully.

Luckily back in the day we spent quite a bit of time to layer parts as best as possible so they don't depend too much on each other. Essentially, Flumotion adds a layer on top of GStreamer where GStreamer pipelines can be run in different processes and on different machines, and be connected to each other over the network. To that end, the essential communication between elements is abstracted and wrapped inside a data protocol, so that raw bytes can be transferred from one process to another, and the other end ends up receiving those same GStreamer buffers and events.

First up, there is the GStreamer Data protocol. Its job is to serialize buffers and events into a byte stream.

Second, there is the concept of streamheaders (which is related to the DELTA_UNIT flag in GStreamer). These are buffers that always need to be send at the beginning of a new stream to be able to interpret the buffers coming after it. In 0.10, that meant that at least a GDP version of the caps needed to be in the streamheader (because the other side cannot interpret a running stream without its caps), and in more recent versions a new-segment event. These streamheaders are analogous to the new sticky event concept in 0.11 - some events, like CAPS and TAG and SEGMENT are now sticky to the pad, which means that a new element connected to that pad will always see those events to make sense of the new data it's getting.

Third, the actual network communication is done using the multifdsink element (and an fdsrc element on the other side). This element just receives incoming buffers, keeps them on a global buffer list, and sends all of them to the various clients added to it by file descriptor. It understands about streamheaders, and makes sure clients get the right ones for wherever they end up in the buffer list. It manages the buffers, the speed of clients, the bursting behaviour, … It doesn't require GDP at all to work - Flumotion uses this element to stream Ogg, mp3, asf, flv, webm, … to the outside world. But to send GStreamer buffers, it's as simple as adding a gdppay before multifdsink, and a gdpdepay after fdsrc. Also, at the same level, there are tcpserversink/tcpclientsrc and tcpclientsink/tcpserversrc elements that do the same thing over a simple TCP connection.

Fourth, there is an interface between multifdsink/fdsrc and Python. We let Twisted set up the connections, and then steal the file descriptor and hand those off to multifdsink and fdsrc. This makes it very easy to set up all sorts of connections (like, say, in SSL, or just pipes) and do things to them before streaming (like, for example, authentication). But by passing the actual file descriptor, we don't lose any performance - the low-level streaming is still done completely in C. This is a general design principle of Flumotion: use Python and Twisted for setup, teardown, and changes to the system, and where we need a lot of functionality and can sacrifice performance; but use C and GStreamer for the lower-level processor-intensive stuff, the things that happen in steady state, processing the signal.

So, there is work to do in GStreamer 0.11:

So, there is a lot of work to be done it looks like. Luckily Andoni arrived today too, so we can share some work.

After discussing with Wim, Tim, and Sebastien, my plan is:

  1. create a common base class for multihandlesink, and refactor multisocketsink and multifdsink as subclasses of it
  2. create g_value_transform functions to bytestreams for basic objects like Buffers and Events
  3. use these transform functions as the basis for a new version of GDP, which we'll make typefindable this time around
  4. support sticky events
  5. ignore metadata for now, as it is not mandatory; although in the future we could let gdppay decide which metadata it wants to serialize, so the application can request to do so
  6. try multisocketsink as a transport for inside Flumotion and/or for the streaming components.
  7. In the latter case, do some stress testing - on our platform, we have pipelines with multifdsink running for months on end without crashing or leaking, sometimes going up to 10000 connections open.
  8. Make twisted reactors
  9. prototype flumotion-launch with 0.11 code by using gir

That's probably not going to be finished over this week, but it's a good start. Last night I started by fixing the unit tests for multifdsink, and now I started refactoring multisocketsink and multifdsink with that. I'll first try and make unit tests for multisocketsink though, to verify that I'm refactoring properly.

26 Jan 2012 10:16am GMT

23 Jan 2012

feedPlanet Twisted

Jonathan Lange: Undistract me

Here's a thing that happens a lot to me: I'm doing some work, and as part of that work I need to run a command in my terminal that takes a little while. I run the command, look at it for about a second and then switch to doing something else - checking email, perhaps. I get deeply involved in my email checking, and then about twenty minutes later I switch back to the terminal and see the command has finished. For all I know, it finished nineteen minutes ago, and I was just too engrossed to notice it.

This is a big productivity sink for me, especially if the command happened to fail and need retrying. I'm not disciplined enough to just sit and watch the command, and I'm not prescient enough to add something to each invocation telling me when a command is done. What I want is something that alerts me whenever long running commands finish.

Well, that thing now exists, thanks to glyph's script that provides precmd and postcmd support to bash and a lot of help from Chris Jones of Terminator.

To use it right now:
$ bzr co lp:~jml/+junk/shell-tools
$ . shell-tools/long-running.bash
$ notify_when_long_running_commands_finish_install


You'll see that if you run a command that takes over 30 seconds to complete, it will pop up a notification, which should hopefully take you away from whatever it was you are doing and back to the task at hand.

If you look at the code, you'll see that it installs two hooks: precmd and preexec. preexec runs just before the shell launches a command, and precmd runs just before it prompts for the next command. Our preexec stores when the command was launched and the precmd checks to see if it finished within a certain time frame. If not, it sends out a notification.

Currently, you'll get a notification when you finish reading a long document, since the command finishes a long time after the command starts. Obviously this isn't ideal. I think the fix is to only send notifications when the shell doesn't have focus. Unfortunately, that's a little tricky and I think is going to be highly terminal specific.

Anyway, I'm a total shell newbie, so I'd love to know if there's any way this could be done better. Also let me know if you find this useful, or you know of someone who has already done this.

23 Jan 2012 5:44pm GMT

21 Jan 2012

feedPlanet Twisted

Jp Calderone: Cleaning Up Branch Checkouts

Since Twisted development typically involves at least one branch per ticket, a Twisted developer can end up with a lot of branches checked out. For example, this morning I had 177 Twisted branches checked out on my laptop. Many of these were branches that I contributed code to, and perhaps even merged into trunk myself when they were complete. I could probably have deleted them at that point, but I usually can't be bothered. Besides, I put everything I have into the branch itself, by the time I'm merging it I'm done. Other branches are ones I've done code reviews on for other developers. I don't keep track of when these get merged into trunk as closely, since typically someone else is going to do those merges.

The incremental cost of another Twisted branch is pretty minimal. A few more megs used on my hard drive is barely noticable. The aggregate cost can get pretty high though (Seven GB for the 177 branches I had this morning). At some point this can cause problems.

Not all of these branches have been merged to into trunk, either, or I could just wipe them all out with ease. And while I try never to leave uncommitted changes in a branch checkout, nobody's perfect... What I really want to do is just get rid of the branches that just aren't relevant anymore.

So I use cleanup-local.py to deal with the mess. It looks at my branch checkouts, talks to the Twisted issue tracker to learn the state of the associated ticket (due to the naming convention for Twisted branches, it is easy to determine which ticket is associated with a branch, given just the branch name). Then it deletes all the checkouts associated with closed tickets (due to the Twisted workflow, if a ticket is closed, it is a very safe bet that you won't need its branch anymore).

The net result is that in (far) less time than it took to write this post, my laptop went from having 177 Twisted branches to having just 34. To save even more time, I could probably set this up as a weekly cron job or something similar. It's easy enough to run now, though, that I just do so manually once every couple of months to keep things tidy.

Here's a brief snippet from today's run:

Found password-comparison-4536-2 for ticket(s): 4536
Status of 4536 is assigned
Found pb-chat-example-4459 for ticket(s): 4459
Status of 4459 is closed
Removing closed: pb-chat-example-4459
Found plugin-cache-2409 for ticket(s): 2409
Status of 2409 is closed
Removing closed: plugin-cache-2409
Found poll-default-2234-2 for ticket(s): 2234
Status of 2234 is closed
Removing closed: poll-default-2234-2


21 Jan 2012 7:29pm GMT

20 Jan 2012

feedPlanet Twisted

Moshe Zadka: Why programmers are concerned about copyright law [Part 1 of 2]

Welcome to a new experiment. I am going to try and explain why programmers tend to be concerned (one way or another) about copyright law, to the level where we compare it to slavery or tyranny. This is not going to be easy, since I intend this to be readable by people who are not programmers, and who never programmed. I am going to start from the very beginning, and it's going to take a while.

Before we even start, I would like to point out an excellent description for why we usually expect explanations to be simpler then they really are. Please do read this. I'll wait, really. If you think you do not need to read it, that's actually evidence that you do - you expect my explanation to be simpler than it is, and that you do not need that bit of background knowledge…

Welcome back! One important concept introduced in the article linked above is "Word of Power". (If you didn't read it, now is the time to fix that issue…) I will try to introduce the new concepts using Words of Power, saying the word, and then linking it to the power behind it.

The first Word of Power will be "Universal Turing Machine". You may remember a previous post of mine about Alan Turing, one of the greatest giants on whose shoulders we have the privilege to be standing on. But I want to start by talking about another great giant, John McCarthy. McCarthy wrote a paper with the somewhat unassuming name, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I. A more appropriate title would have been "paving the way forward for programming for the next 50 years", you see. What McCarthy did in the paper was to lay down a simple explanation of a system for defining computation. Nowadays, a lot of computation is done, appropriately, on electronic computers. Before we had those, however, "computer" was a job title of a person who applied computation rules to symbols on paper to get results - much like the rules we learned in primary school for doing long multiplication.

In both sense of the word "computer", the goal is the same: apply rules to manipulate symbols. McCarthy defined a specific set of rules to manipulate symbols that he called "Lisp". Lisp used S-expressions (short for "symbolic expressions") to define computations. Read this sentence over again, because it might be the most important sentence in here: even though Lisp was a set of rules to manipulate S-expressions, those S-expressions define other computations. If you only learned to manipulate S-expressions, someone could write an S-expression that did, say, long multiplication. Then, if you manipulated the S-expression for long multiplication followed by S-expressions representing numbers, you would do long multiplication. Now comes the fun part - you could even write the rules for manipulating S-expressions using S-expressions. You might think that for this to work, the rules for manipulating S-expressions would be really complicated. This is not true - you can see the manipulation rules here. Sure, it's a bit long, but doesn't look like more than you needed to learn to do arithmetic, right?

Well, what use would rules for manipulating S-expressions when already know how to manipulate S-expressions? Maybe very little, if not for a few other thing. I hope you have seen a demonstration of the game of life, once upon a time. If you haven't, I recommend that you read the wikipedia article. With 4 simple rules, much much simpler than McCarthy's S-expression rules, much is possible. How much? Well, the proof is long and difficult, but mathematicians have found a Game of Life configuration that will manipulate S-expressions. So if you only knew to manipulate squares on graph paper according to the rules of the Game of Life, seemingly easier than manipulating S-expressions, we could just give you a (pretty hefty) Game of Life that would cause you to manipulate S-expressions. Wait, but manipulating S-expressions lets you do arithmetic, right? So you could do arithmetic too!

Wait, what about arithmetic? Well it turns out, if you just know how to add 1 to numbers, and do a bunch of other trivial things (like substitute numbers for other numbers), we can give you a instructions that will let you manipulate the Game of Life. Or other instructions, that will manipulate S-expressions. It turns out that if you can do interesting enough computations, it doesn't matter a whole bunch what you know to do - you can find an initial input (S-expression, Game of Life configuration, or arithmetic instructions) that will let you do anything. Anything? Well, not quite!

Another interesting result (called "Rice's Theorem") is that you cannot write an S-expression (or any of the other things) to take any S-expression (or any of the other things) and say anything interesting about what it does to any input. Notice that I didn't say "respectively", and didn't mean to - you can't make an S-expression that will take any Game of Life configuration, and will say anything interesting about what it does. You can write S-expressions that will take some S-expressions (or …) and say something interesting about them - but not any S-expression (or …).

What has all this to do with our friend, Alan Turing? Well, Turing invented the original thing that stands for "…", and he called it a Turing machine. A Turing machine has an internal state, and reads from and writes to a tape. The interesting thing about a Turing machine? You can write a Turing machine that will manipulate S-expressions. Or play the Game of Life. Or perform arithmetic. Any of those Turing machines can do anything S-expressions do! All of those are called "Universal Turing Machine", because any computation possible by any other Turing machine, or S-expression, or Game of Life, or arithmetic, can also be done by them. You only need One Turing machine, or one S-expression, or one Game of Life configuration, or one set of arithmetic rules and you can do anything anyone else can do, if maybe somewhat slowly. Fortunately, or unfortunately, all those computations also have the same basic flaw - you can never write one of those to say anything interesting about any computation that comes their way (this will be important later!)

Let us recap:

  1. Computations that manipulate symbols can be done either electronically or by a human, equally well, if not equally fast.
  2. You can define a "universal computation", that if you learn to do, anyone can get you to do any other computation by writing the correct input.
  3. It is impossible for you to know anything interesting about any possible computational input.

Those three concepts ("a lot of interesting manipulations are equivalent", "universal manipulations are possible" and "nothing interesting to say") are the core of what computers are. A computer chip does very specific manipulations ([1]). However, programmers, by writing carefully thought out inputs, can get it to perform any computation ([2]). Lastly, it is impossible to write a computer chip that will be able to say anything interesting about any specific program ([3]). Of course, crafting these programs is difficult - this is why programmers use "programming languages", which are even more "universal computation instructions". A language is called "Turing complete" when you can use it to build a Turing machine (which would also have allowed you to build an S-expression manipulator, or Game of Life player, etc.). In short, all we have said so far applies equally well to most programming languages. There are niche non-Turing-complete languages, used where being able to say something interesting about any possible program is important - but here's the rub - it's actually hard to invent something which is not Turing complete. As we saw above, even very simple things will be able to build Turing machines.

Computers, languages and programmers are all becoming better at making computations that were "possible" a decade ago be "fast" now. There is an enormous economic pressure on that - after all, the more you can do "fast", the more you can do "more and bigger", and people like "more and bigger".

Now, we come to a very simple computation - copying. A "copier" can be defined in multiple ways, but let's suppose, for the sake of argument, that we just want to replicate the input twice ("have two copies"). It is easy to write an S-expression to do it. Therefore, any Turing complete environment can do it. But remember - it is impossible to say anything "interesting" about any possible S-expression, and therefore, it is impossible to write a computation that outputs the correct answer for "this S-expression is not a copier". Wait, what? Yes, that's right. If an S-expression has access to some input, it is impossible to know for sure that it will not copy it.

The next Word of Power I wish to introduce is "bits". A "bit" (short for binary digit) is a place-holder for something that can be either 0 or 1. Anything that can represent at least two states (say, the light switch in your kitchen) is a bit. Now, let's say that you have a set of symbols - say musical notes. I hope you have seen "Sounds of Music", and remember the best song ever - "Do, a deer" (etc.). Our symbols are "Do, Re, Mi, Fa, Sol, La, Ti". As the song makes the case, we can write any piece of music with just those (yes, I know about sharps, flats and octaves…bear with me). Now, let's assign "bit patterns" to each note:

Now, if we want to write the notes for the first line of "Twinkle, Twinkle" (Do Do Sol Sol Ti Ti Sol) we can instead write 000000100100110110100 in bits. If we have those bits, we can just take them in threes, and convert them back to the musical notes. We can do the same with any set of symbols - say, the alphabet. This is actually how, more or less, computers store text - they can convert the alphabet into bits, and save those bits. Then they can manipulate those bits using carefully crafted programs to, say, replace the word "Foo" with "Bar". Or, say, copy them. If we have the bits to a fan fiction of Mickey Mouse having sex with Pluto, we can copy those too. If instead, we have the bits for an S-expression that will generate the bits for a fan fiction of Mickey Mouse having sex with Pluto, we could copy those too. But wait, this is funny - there is no way to know for certain that certain bits are not an S-expression that will produce the bits of Mickey Mouse having sex with Pluto.

A fan fiction of Mickey Mouse having sex with Pluto is a dangerous thing. It is, as the law and case law currently stands, a violation of copyright law. In general, it is not protected under the Fair Use doctrine, and this means that writing this fan fiction is illegal. Copying this fan fiction is illegal. What's more, Disney has an incentive (or at least, believes it has an incentive) to prevent copying this fan fiction around. When incentive (or perceived incentive) and legal powers combine, the result is expected - Disney would dearly love to have an automatic way to prevent computers from copying this fan fiction. Or, say, from copying the bits that represent the video of "Cars 2″.

Remember what I said above - it's impossible to have a way to know for certain which bits are actually an S-expression that will create the bits for the "Cars 2″ video. Although Disney has the incentive (this is a matter of economics and psychology, ultimately the results of the forces of evolution) and the legal powers (this is a matter of social convention), the math, hard and unyielding, doesn't care. Math doesn't care about evolution. Math doesn't care about society. Math is math, and the math says that you can't build a computer that will only copy things if they're not S-expressions that produce the Cars 2 video, no matter how much you want to.

Join me next episode, when I explain the basics of cryptography, and how they pertain to the issue of copyright law.


20 Jan 2012 5:37pm GMT

Moshe Zadka: Why programmers are concerned about copyright law [Part 2 of 2]

[A lot of the ideas in this edition awe a lot to Cory Doctorow. The responsibility for any mistakes or omissions are still mine.]

Recapping the previous episode:

In this episode, I will have a lot more to say about bits. Before I delve into bits, though, I want to talk about computers. While abstract systems that do computations abound, most modern computers are fairly similar. The theoretical computational system they are built to resemble are so-called "Von Neuman Machines". A modern computer has lots of peripherals, but strip away the peripherals and you are left with:

  1. A "microchip" which implements instructions like "Add Register1 and Register2 and put the Results in Register3″
  2. Memory - mapping of "addresses" (index numbers) to "values"
  3. Instructions on the microchip of the sort "Treat register 1 as an address, and fetch the value there into register 2″

The exact instruction set that the microchip implements depends on the type, but ultimately, as we saw before, it does not matter too much - all computational systems are equivalent. What is important is "Moore's Law", of which many variants exist but ultimately says that:

  1. Microchips can perform more calculations per second every year.
  2. The amount of memory available for a given price keeps growing.

An important computer peripheral, which almost all computers have a variant of, is a storage device. That ranges from a magnetic hard drive to a micro-SD card. All that is important, for our purposes, about those is:

  1. Accessing them is slower than memory
  2. They are bigger (have more addresses) than memory for the same price [a lot more -- frequently 10x or 30x]
  3. They grow bigger for the same price every year

Another important computer peripheral is the network card. Networks connect computers to each other.

Now, let's remember something we said earlier - there is no way to reliably detect any sequence of 0s and 1s that encodes a computation that we do not like - say, a Turing machine outputting a fanfiction about Micky Mouse having sex with Pluto. Therefore, if a network card allows you to send two distinct messages [if it only allows you to send one message, it's kind of a sucky network card], you can send some "illegal" Turing machine encoding. Moreover, you will be able to store this illegal Turing machine faster and more cheaply every year.

So as an inescapable conclusion of (1) The math (b) Moore's law we see that the law cannot be enforced, and it costs less and less every year to evade this law. The laws of the universe (such as how easy it is to use Quantum Mechanics to implement computation on the stuff that is abundant on every ocean beach, or Rice's theorem) care nothing for Walt Disney or a starving artist - they are simple there, immutable and unforgiving, and humans must learn to deal with them.

One way to deal with them would be to "follow the money": enforce the law only when someone is breaking it for corporate-level gain. This is not the way the legal system has gone through. Instead, they turned to God - "Crypto. Really good, standards-defined crypto".

So to understand what transpired, it is important to understand the basics of cryptography. From the beginning of time (as humans count time, I guess), people have communicated with each other - "Look, Ug, I found an antelope and killed it. Help me eat it?" Not long afterwards, eavesdropping begun - "Hey, everyone, Ung has an antelope." The next level was to use codes - "Look, Ug, I found a You-know-what and you-know-what-ted it. Help me you-know-what it?", and so the battle begun.

Codes, like in the example above, have to balance two issues. The person to whom they are intended must be able to decipher them (Ung better hope Ug will get the right message). The person to whom they are not intended for must not be able to decipher them (or once again, the whole village will know about the antilope). Fast-forward to Greek times, Caeasar had the eponymous cipher, based on shifting every letter in the alphabet by a certain amount and the Nazis had Enigma. Word to the wise: use neither of those, as they both have been "broken". "Broken" is a term cryptographers use to say "the eavesdropper can read the messages which are not intended for them". Cryptographers spend a lot of time trying to figure out how to read those…

But back to the basics: In computer-based cryptography, we have a secret, S. S can be considered, like everything else a computer handles, as a certain sequence of bits. For cryptography to work, two parties have to pre-agree on S. (Side note: I will not be covering public-key cryptography here.) Then, we must be able to compute the "encryption": a computation that takes M, a message, and calculates a function that depends on M and S. Then we must be able to compute "decryption": a function that takes the encrypted message, E, and S, and returns to us M. Next, it should be impossible (or at least, very hard), to compute M from E without S.

How do we know that something is "very hard"? Well, if we have a guess for S, we can check that the decryption gives us a plausible message. How hard is it to guess S? Returning to the quotation at the beginning of the section, "standards-based crypto" is usually at least 128 bits of S (also known as the "key size"). That means that there are 2 to the power of 128 options. It is largely agreed that the smallest time-frame relevant for computation is the time it crosses a photon (light particle) to cross a hydrogen atom. A hydrogen atom is about 10**-11 meters long. The speed of light is about 10**8 meters per second, which means it takes more than 10**-20 seconds, which is more than 2**-70 seconds. Thus, if we have 2**128 options, we need more than 2**58 seconds. 2**30 seconds is more than a year, so it will be more than 2**28 years, which is about 32 million years. I used various approximations above, but the conclusion still stands - and more and more modern crypto uses 256-bit sized keys, which is not quite the heat death of the universe, but the point still stands: you cannot guess and hope to win. However, it turns out to be an open problem whether any question where you can easily verify guesses (the formal name for that is NP) is "hard" (the formal name for "easy" problems is P). By open, I mean that many computer scientists have tried tackling this problem over the last 50 years, with no success (and little progress).

This means, in particular, we cannot know that any cryptographic computation is really "good". However, what we can do, and the second part of why "really good" is followed by "standards-based", is to ask really smart people to try and solve a cryptographic problem. If they can't, after trying really hard, we assume that the problem cannot be solved, and we used it as our encryption mechanism.

In modern times, the names "Ug" and "Ung", perfectly good though they may be, have fallen into disuse. Modern cryptographers usually talk about "Alice" and "Bob" wanting to transmit messages, and "Eve" wanting to listen to them surreptitiously.

So here is an example of a cryptographic system: if you buy a DVD in the store, it is encrypted with a special key. When you buy a DVD player in the store, is contains the key. Your DVD player is a computer, with special software that decrypts the DVD, with a special key, and then plays it. So Alice, which is the Hollywood studio, encrypted the movie contents. Then Bob, the DVD, decrypts it. Bob, to play the DVD, must have the key. You bought the DVD, so there is nothing stopping you from opening up the DVD player, and taking the key, is there?

Well, there are two things. Once is simple: Hollywood, before giving the DVD maker the key, make sure that the DVD player is hard to "tamper" with. The DVD maker must put the key in a special chip, glued to the microchip, and that self-destructs if people tamper with it. As you can imagine, creative people have found ways to defeat that self-destruction. And so, Hollywood convinced the US government to pass a law called "Digital Millenium Copyright Act". The DMCA says that

  1. It is illegal to tamper with the DVD that you bought and get the key out.
  2. It is illegal to tell someone the key that you dug out of the DVD.
  3. It is illegal to tell someone how to tamper with the DVD to get the key out.
  4. It is illegal to tell someone where to find instructions on how to tamper with the DVD, or how to get the key.

You might have noticed that 2-4 are restrictions on speech. There are numbers that are so illegal, that not only are you not allowed to write them on a piece of paper and give them to your friend, but if someone spray-painted them on a building, you are not allowed to tell anyone where that building is. Whether you're allowed to tell them where they can find a map with the building starred in it is, I believe, still up for debate.

The incredulity continues as you find songs on You Tube and Flag images that encode illegal numbers. This means that certain songs and flags are now illegal. In fact, it very well might be that the Wikipedia page on "Illegal numbers" is already illegal, since it contains data that can allow recovery of these numbers. I wish I were kidding, but I am not.

I, personally, am not a copyright extremist. I am not committed to abolish copyright. However, I think that understanding the math and physics of computation are important, because otherwise we end up making songs illegal. When Hollywood claims that copyright infringement might cause jobs lost, and that this necessitates stronger copyright law, we should first ask "Will this make singing songs illegal?"

This is why programmers are concerned about copyright law - because we understand all of the above. I hope, after reading this, you are also concerned!


20 Jan 2012 5:36pm GMT

Jack Moffitt: The More Things Change: A Review of The Soul of a New Machine

Already in my career I've experienced enormous passion, burnout, extraordinary dedication to my team and projects, and depression. I'm sure many others have as well. Has it always been this way with technology? I often wonder if this rollercoaster is necessary, healthy, or normal.

I recently saw a recommendation for Soul of a New Machine, which tells the story of a team of engineers at Data General who built a new 32-bit computer in the late 1970s. The book is fascinating. Thirty year later, many of its descriptions of the project and the way the team worked and was treated could apply to any modern project.

The plot summary will no doubt sound familiar to you: A team of mostly young, mostly male engineers works grueling hours to build something amazing in too short an amount of time. They succeed, albeit a bit over their original schedule. Despite the project's commercial success, the team is denied both recognition and financial rewards and many end up leaving the company. Almost all of them ultimately enjoyed it and would (and did) do it again.

There were many pieces of this story that resonated with me.

Work is a Drug

On overworking Tom West, the manager of the team in the book, says:

That's the bear trap, the greatest vice. Your job. You can justify just about any behavior with it. Maybe that's why you do it, so you don't have to deal with all those other problems.

Why deal with the unpredictable world, when the controllable world of creation is available? It's code as escapist drug, and I love to get high on it. Mundane things like cleaning my house, and more serious ones like taking care of my health, are all easy to avoid while fixing bugs or starting a new project.

It's both possible and important to find a balance.

The team's secretary, who was much more than her title suggests, suffered and succeeded with the rest of the team. Even she says:

I would do it again. I would be very grateful to do it again. I think I would take a cut in pay to do it again.

Even as I recover from projects that burned me out, I am constantly thinking about how to do new ones. In fact, while I'm doing any project, I'm already thinking about doing another. This sounds like drugs again. But they are good drugs.

Harassment and Treatment of Women

The book describes how some team members tormented the lone female engineer. This is something that still happens today, and it's terrible. And people then wonder why there are so few women in our industry.

In addition to that, at the end when they hand out the peer awards, their award to the woman was for putting up with them, not for any of her actual accomplishments.

Betty Shanahan was that lone woman, and it looks to me that she deserved more than just an award for thick skin. She's the CEO of the Society of Women Engineers, and she was "a member of the design team for the first parallel processing minicomputer and manager of hardware design for subsequent systems." She later moved to the business side of technology, and I wonder if that had anything to do with her having to put up with the Eagle team's harassment.

How Something is Done is Important Too

Often we judge things by their properties, but one can also rightly judge something by how it is made. Shoes made from child labor are less good than those made in other ways.

Kidder, the book's author, discusses this:

In The Nature of the Gothic John Ruskin decries the tendency of the industrial age to fragment work into tasks so trivial that they are fit to be performed ony by the equivalent of slave labor. Writing in the nineteeth century, Ruskin was one of the first, with Marx, to have raised this now-familiar complaint. In the Gothic cathedrals of Europe, Ruskin believed, you can see the glorious fruits of free labor given freely. What is usually meant by the term craftsmanship is the production of things of high quality; Ruskin makes the crucial point that a thing may also be judged according to the conditions under which it was built.

By this kind of measure, is the work many teams do good? Is the Eagle computer that Tom West's team built really a success since the team worked much overtime, suffered divorces and other problems, and in the end received little to no reward?

I think it's time for entrepreneurs and workers in our industry to demand better. Our outputs will be better if they are made sustainably, and not just by the measure above. In retrospect, maybe the reviewers of LA Noire should have taken into the account the trials of its developers; it certainly would not have fared well.

Freedom of Expression

I want to hire resourceful people. I want to describe a general outline of a design and not have to describe it in intricate detail in order for them to build it.

It turns out that this is critical for happiness. If we're told exactly how to do something, it takes much of the creativity and fun out of the work.

Engineers are supposed to stand among the privileged members of industrial enterprises, but several studies suggest that a fairly large percentage of engineers in America are not content with their jobs. Among the reasons cited are the nature of the jobs themselves and the restrictive way sin which they are managed. Among the terms used to describe their malaise are *declining technical challenge; misutilization; limited freedom of action; tight control of working conditions*.

You must trust those you work with to be resourceful. If you don't trust them, you will end up micromanaging them into unhappiness, and you will also remove their valuable creative input from your product.

There is a balance to be struck with feedback. The Eagle engineers thought that the managers didn't appreciate their efforts, but in reality, some of this was them trying to stay out of the way. Kidder asked the Tom West's boss:

Had the Eagle project always interested him or had it grown in importance gradually?

"From the start it was a very important project."

Was he pleased with the work of the Eclipse group?

"Absolutely!" His voice falls. "They did a hell of a job."

But some members of the team felt that they had been rather neglected by the company.

"That doesn't surprise me," he says. "That's frequently the case. There's often a conflict in people's minds. How much direction do they want?"

I've had this same issue with investors as well. You don't want them to meddle with your company or your product, but you also want their advice and guidance. It's possible to go too far in either direction, but mostly you hear about stories where investors meddle too much. I personally think it's probably better to err on the side of too little help than to end up with too much meddling.

The Venture Capitalists

Even thirty years ago, the VCs had a bad rap. Tom West was asked in a Wired article years after the book's publishing why he stayed at Data General until he retired:

"You could do new products and companies within the company, rather than shag some venture capitalist and kill yourself for five years." To be an entrepreneur, he says, "you have to be interested in networking, even with fools."

This is another reason why I would prefer to bootstrap companies if at all possible.

Tom West ended up working on many interesting projects at Data General, but ultimately, none of them got the support or recognition they deserved. The other members of the Eagle team spread out and started or worked for new companies, and in general seemed much happier.

Final Thoughts

In the end, it's both a fascinating tale of heroism and creativity and a saddening tale of undervalued and underpaid engineers. I am both emboldened to keep following my passions and more mindful of its dangers. My troubles are not unique - not even modern. Thirty years after this book was written, I feel like it could have been written yesterday.

20 Jan 2012 10:20am GMT

Glyph Lefkowitz: The Concurrency Spectrum: from Callbacks to Coroutines to Craziness


Concurrent programming idioms are on a spectrum of complexity.

Obviously, writing code that isn't concurrent in any way is the easiest. If you never introduce any concurrent tasks, you never have to debug any problems with things running in an unexpected order. But, in today's connected world, concurrency of some sort is usually a requirement. Each additional point where concurrency can happen introduces a bit of cognitive overhead, another place you need to think about what might happen, so as a codebase adds more of them it becomes more difficult to understand them all, and it becomes more challenging to understand subtle nuances of parallel execution.

So, at the simplest end of the spectrum, you have callback-based concurrency. Every time you have to proceed to the next step of a concurrent operation, you have to create a new function and new scope, and pass it to the operation so that the appropriate function will be called when the operation completes. This is very explicit and reasonably straightforward to debug and test, but it can be tedious and overly verbose, especially in Python where you have to think up a new function name and argument list for every step. The extra lines for the function definition and return statement can be an impediment to quickly understanding the code's intentions, so what facilitates understanding of the concurrency model can inhibit understanding of the code's actual logical purpose, depending on how much concurrent stuff it has to do. Twisted's Deferreds make this a bit easier than raw callback-passing without fundamentally changing the execution dynamic, so they're at this same level.

Then you have explicit concurrency, where every possible switch-point has to be labeled somehow. This is yield-based coroutines, or inlineCallbacks, in Twisted. This is more compact than using callbacks, but also more limiting. For example, you can only resume a generator once, whereas you can run a callback multiple times. However, for a logical flow of sequential concurrent steps, it reads very naturally, and is shorter, as it collapses out the 'def' and 'return' lines, and you have to think of at least two fewer names per step.

However, that very ease can be misleading. You might gloss over a 'result = yield ...' more easily than a 'def whatever(result): return result; something(whatever)'. Nevertheless, if you have 'yield's everywhere you might swap your stack, then when you have a concurrency bug, you can look at any given arbitrary chunk of code and know that you don't need any locks in it, as long as you can't see any yield statements. Where you do see yield statements, you know that you have some code that needs to be inspected.

To continue down that spectrum, a cooperatively multithreading program with implicit context switches makes every line with any function call on it (or any line which might be a function call, like any operator which can be overridden by a special method) a possible, but not likely culprit. Now when you have a concurrency bug you have to audit absolutely every line of code you've got, although you still have a few clues which will help you narrow it down and rule out certain areas of the code. For example, you can guess that it would be pathological for 'x = []; ...; x.append(y)' to context switch. (Although, given arbitrary introspection craziness, it is still possible, depending on what "..." is.) This is way more lines than you have to consider with yield, although with some discipline it can be kept manageable. However, experience has taught me that "with some discipline" is a code phrase for "almost never, on real-life programming projects".

All the way at the end of the spectrum of course you have preemptive multithreading, where every line of code is a mind-destroying death-trap hiding every possible concurrency peril you could imagine, and anything could happen at any time. When you encounter a concurrency bug you have to give up and just try to drink your sorrows away. Or just change random stuff in your 'settings.py' until it starts working, or something. I never really did get comfortable in that style. With some discipline, you can manage this problem by never manipulating shared state, and only transferring data via safe queueing mechanisms, but... there's that phrase again.

Some programming languages, like Erlang, support efficient preemptive processes with state isolation and built-in super-cheap super-fast queues to transfer immutable values. (Some other languages call these "threads" anyway, even though I would agree with Erlang's classification as "processes".) That's a different programming model entirely though, with its own advantages and challenges, which doesn't land neatly on this spectrum; if I'm talking about left and right here, Erlang and friends are somewhere above or below. I'm just describing Python and its ilk, where threads give you a big pile of shared, mutable state, and you are constantly tempted to splash said state all over your program.

Personally I like Twisted's style best; the thing that you yield is itself an object whose state can be inspected, and you can write callback-based or yield-based code as each specific context merits. My opinion on this has shifted over time, but currently I find that it's best to have a core which is written in the super-explicit callback-based approach with no coroutines at all, and then high-level application logic which wraps that core using yield-based coroutines (@inlineCallbacks, for Twisted fans).

I hope that in a future post, I may explain why, but that would take more words than I've got in me tonight.

20 Jan 2012 6:23am GMT

17 Jan 2012

feedPlanet Twisted

Twisted Matrix Laboratories: December Sprint Report

Twisted sprint? Twisted sprint! Here's the final Twisted sprint report of 2011, from our December 10th event at Smarterer in Boston.


David Sturgis:


JP Calderone:

This was the last sprint for JP as a Bostonian. We will miss you!


Itamar:


Alex Levy:


Glyph:


I (Jessica McKellar):


Thank you David for organizing this, and Smarterer for hosting.

Thank you to everyone who closed out 2011 with contributions to Twisted!

17 Jan 2012 4:15am GMT

12 Jan 2012

feedPlanet Twisted

Jack Moffitt: The Potentially Dark Future of Search

Twitter sees Google's latest Google+ feature, integration into Google search, as anti-competitive, and it probably is. However, it brings to the surface some real issues with the future of search and of data.

Twitter's argument:

We're concerned that as a result of Google's changes, finding this information will be much harder for everyone. We think that's bad for people, publishers, news organizations and Twitter users.

Google's response was:

We are a bit surprised by Twitter's comments about Search plus Your World, because they chose not to renew their agreement with us last summer (http://goo.gl/chKwi), and since then we have observed their rel=nofollow instructions.

People have been digging into the semantics of nofollow (see Danny Sullivan and Luigi Montanez), but there is a much bigger issue.

Google and other established and up-and-coming search engines have no real way to include lots of data in their index. It's easy to imagine that the lack of access to Twitter and Facebook data was a motivator for Google+ in the first place.

Lots of sites now generate enough data that it is unrealistic to crawl them. For example, Youtube has more new content every day than they allow anyone to crawl. Twitter is essentially the same. This means there is no way to index this data without special arrangements with the provider. Twitter has closely guarded their firehose of data, but at least they have some mechanism to obtain it. Youtube, as far as I am aware, has no such mechanism.

My team and I ran into this problem head on trying to build Collecta, a real-time search engine. Access to the data was a primary blocker for many features and product ideas, and over the too short life of that company, access became significantly more difficult, not easier.

Google can build an effective search, even a real-time one, for Youtube, but no one else can. Twitter can build search for their data, but few others can, and their data access policies can and do change on a whim.

If Google believes that microblogging data will improve their search product, then a reasonable strategy to obtain that data is to try and build their own microblogging service to generate it. I can't fault Google for trying. If I thought Collecta could have effectively competed against Twitter for their audience, I would certainly have attempted that as well.

Google, Twitter, Facebook and others are hoarding silos of otherwise public data. Not only is this artificially limiting the features of their products, but it squashes the potential for new and exciting search applications. The search services that have sprung up are limited to your own data, aggregate results from service-specific search APIs, exist at the mercy of data providers, or make do with a tiny subset of the data. I don't think Google could have built their own search engine if the Web were similarly hostile.

One could argue for requiring these bits of data to be openly available, but unlike the data of the past, this data is expensive to publish and consume. Most of these services may not even have a mechanism to publish the data, even internally. Simply receiving the Youtube or Twitter firehoses (and not counting video or image media) would require significant engineering effort, and the rate of data generation is only accelerating.

I think we must push for open access to data, even if it is costly. These data wars benefit very few. If things don't change, the future of search is dark.

12 Jan 2012 10:12am GMT

10 Jan 2012

feedPlanet Twisted

Jp Calderone: Learn About Twisted at PyCon 2012

At PyCon this year I'll be presenting a tutorial to introduce Python programmers to Twisted. This tutorial has two goals. First, to give attendees a firm grasp of Twisted's concurrency model, both in the abstract and the concrete. Second, to remove the mystery around the tools Twisted provides for developing robust, testable concurrent applications. If you attend, you'll come away with an understanding of how event loops work and how to write code that works best in Twisted's event loop.

I am a long time core Twisted developer with real world experience building maintainable, scalable systems with Twisted. I've also presented similar introductory Twisted tutorials several times in the past, letting me learn the common sticking points and teaching approaches to help overcome them.

Check out the tutorial's page on the PyCon 2012 website for details about what will be covered. Come learn how to leverage Twisted and Twisted-based libraries to their fullest extent!

10 Jan 2012 6:21pm GMT

05 Jan 2012

feedPlanet Twisted

Moshe Zadka: Retroactive New Year’s Resolutions

As is my tradition, I will be posting my New Year's resolutions from last year, retroactively decided:

Get into a serious relationship

Get engaged

Move to a team at work where I fit in better

Move to a nicer apartment

Get a loan (to build up my credit history)

Go to a Less Wrong meet-up

Get a talk accepted at PyCon

Get more recommendations on my Linked In profile

Write Rationalist Fan Fiction

Contact my Congressperson


05 Jan 2012 10:17am GMT

02 Jan 2012

feedPlanet Twisted

Thomas Vander Stichele: How do you manage mailing lists?

Every new year is a time of cleaning. After getting back to Inbox 0, my next target is my mailing list subscriptions.

It must be something psychological, but I cannot bring myself to unsubscribe from some of these mailing lists. I don't check on them daily, but once in a while it's darn useful to search through my local copy of mails on, say, selinux, and find solutions for a problem I'm having.

However, all this mailing list mail brings me a lot of headache. My email client is slow, and I would want it to be fast for the real mail I'm getting (from actual people, needing actual work). It's hard to track the mails that matter - all my list mail gets put into folders automatically with some procmail magic, but it also means that some of the things I should be paying more attention to are just another bold folder in Evolution somewhere down the mail tree. And lastly, the server where I host my mail shared with friends gets too much traffic, and syncing 3 different evolutions over IMAP with it is a big part of the burden.

I vastly prefered the newsreader model of old, and I think the de facto standard of mailing lists really is a mistake. But I'm not sure what to replace it with.

What I want:

  1. have selected mailing list archives be available on my machines, locally
  2. have them synced/updated automatically
  3. have them out of the way of my normal mail usage unless when I need them

I've been considering getting a separate email account just for email lists for this purpose, although I don't look forward much to having to change all my subscriptions, and would first like to hear from other people how this approach works out for them.

There used to be a push towards web-based mailing list subscriptions, but I don't know if anyone is really seriously using that, and I would like to have the option of reading these mailing list archives offline.

How do you separate your 'real' mail from your mailing list mail? How do you handle them?

02 Jan 2012 4:00pm GMT

30 Dec 2011

feedPlanet Twisted

Thomas Vander Stichele: using xargs on a list of paths with spaces in a file

Every few weeks I have to spend an hour figuring out exactly the same non-googleable thing I've already needed to figure out. So this time it's going on my blog.

The problem is simple: given an input file listing paths, one per line, which probably contain spaces - how do I run a shell command that converts each line to a single shell argument ?

Today, my particular case was a file /tmp/dirs on my NAS which lists all directories in one of my dirvish vaults that contains files bigger than a GB. For some reason not everything is properly hardlinked, but running hardlink on the vault blows up because there are so many files in there.

Let's see if wordpress manages to not mangle the following shell line.

perl -p -e 's@\n@\000@g' /tmp/dirs | xargs -0 /root/hardlink.py -f -p -t -c --dry-run

30 Dec 2011 6:18pm GMT