06 Feb 2012

feedPlanet Grep

FOSDEM organizers: Thank you, volunteers!

FOSDEM would like to thank all volunteers who helped make our conference possible again. The bussload of students that helped with the setup, the numerous volunteers that reacted to the call for volunteers, the people who spontaneously showed up at the infodesk offering their services. The regular veterans, the new blood. You all did a splendid job and I sincerely hope to see you all again next year.

Thanks, guys. Couldn't have done it without you!

06 Feb 2012 11:40am GMT

05 Feb 2012

feedPlanet Grep

Frederic Descamps: Fosdem 2012 - Pictures

I already uploaded the pictures of Fosdem and especially the MySQL & Friends devroom:




Thank you to all visitors and speakers !

I hope you enjoyed it an see you next year !

05 Feb 2012 10:30pm GMT

FOSDEM organizers: Error on printed schedules

We have discovered that the printed schedules end about an hour before the conference is scheduled to end. We are still trying to decide whether the schedule is too long or the paper too short.

Note that at 17:00 in Janson, Bdale Garbee will present his Freedom, Out of the Box! keynote. This will be followed immediately by the closing talk and FOSDEM dance.

Be there ... or be elsewhere!

05 Feb 2012 12:27pm GMT

FOSDEM organizers: Video feedback?

FOSDEM is streaming video from a select number of rooms this year (see the URLs if you want to watch).

Watching the stream from home? Love it? Hate it? Feedback is much appreciated!

You can join us through IRC: #fosdem on Freenode, or (with slightly higher latency) use hashtag #fosdemvideo on twitter.

Thanks!

05 Feb 2012 10:24am GMT

Steven Wittens: A Useful BitTorrent Analogy

A Useful BitTorrent Analogy

Xerox 914 copier The first successful commercial photo copier, the Xerox 914.

BitTorrent has been around for over a decade now. And yet, when mentioned in the media, it's pretty much universally associated with piracy and illegal file sharing.

Just the other day, I saw a journalist write proudly: "No, I don't have a Torrent program and I'm not downloading one." A journalist! Someone who is supposed to be an expert at retrieving information and sharing it!

BitTorrent is not scary, and more so it actually generates the majority of traffic on the internet. In the 21st century it should be a tool that sits on your digital utility belt, not something you wouldn't touch with a 10 foot pole. So here's a simple analogy to help understand it.

· • ·

Imagine a budget-starved teacher needs to hand out notes for class, but can only afford one copy. The document is 10 pages long, and there are 10 students who each need a full set.

The teacher could just give the notes to one student, and ask him to make all the copies, but that would only shift the burden, leaving him to pay for all 100 pages.

Instead, the teacher has an idea. She hands page 1 to student #1, page 2 to student #2, and so on, and tells each student to make 10 copies of their single page. The next week, the students can distribute them amongst themselves before class, and everyone has a full set. Nobody has to pay for more than their own 10 pages.

Everyone's happy: the teacher gets to share her knowledge cheaply, and the students don't mind paying for their own copies.

In the middle of the term, a new student joins. She could borrow someone else's big pile of notes, and copy the entire stack of paper, but that would mean she would have to pay for it all, and she's on a budget too.

So instead, she just goes around and asks each student to make a single copy of the pages they were assigned previously. The next week, she collects all the copies, and gets a full set without even bothering the teacher.

She gets a free pass to get up to speed, but the other students don't mind chipping in. That's because she immediately joins the game and can make copies too. The teacher can now hand out one page extra each week, or decide to give one student a free pass. If more students join, it works better and better.

Now instead, imagine that students join and leave the class every single day, and the teacher isn't quite so organized. She just puts her big stack of notes on the desk, and tells everyone they can take any page they want, as long as they promise to immediately make copies for anyone who asks. The students are all friendly, and make sure to keep each other in the loop about which pages everyone has. Both the originals and the copies are copied as many times as needed.

· • ·

That's BitTorrent in a nutshell. For any given class-i.e. a file that people are interested in-a cloud of students forms-i.e. the peers in the so called peer-to-peer network. The peers compare notes, see which pieces they are missing, and swap copies with each other. Eventually, the teacher (a.k.a. the seeder) can leave, taking her original copy with her, and the system will keep working. As long as there is at least one copy of every page in the room, the students can make more, and the full set lives on.

This is pretty much the only way you can effectively distribute a massive archive of sensitive data to thousands or millions of people, without incurring massive bills. You can't use free or ad-supported services, as the material would get taken down instantly due to its sensitive nature. And you can't host it directly, as that would leave a trail pointing back to you.

With BitTorrent, your initial group of 'students' can be sworn to secrecy. After the initial round of copying, the teacher sneaks out, and the students just pin a notice on the bulletin board: "We have copies of The Forbidden Secrets by Dr. X. Come see us." Nobody claims to know who Dr. X is. Ideas and information flow freely, without censorship.

05 Feb 2012 8:00am GMT

04 Feb 2012

feedPlanet Grep

Wim Coekaerts: Changing database repositories in Oracle VM 3

At home I have a small atom-based server that was running Oracle VM Manager 3, installed using simple installation. Simple installation is the option where you just enter a password and the Oracle VM Manager installer installs : Oracle XE database, WebLogic Server and the Oracle VM Manager container. The same password is used for the database user, Oracle VM Manager database schema user, weblogic user and admin user for the manager instance.

The manager instance stores its data as objects inside the database. To do that, there is something called a datasource defined in weblogic during installation. It's basically a jdbc connection from weblogic to the database. This DS requires the following information : database hostname, database instance name, database listener port number, schema username and schema password. In my default install this was localhost, XE, 1521, ovs, mypassword.

Now that I re-organized my machines a bit, I have a larger server that runs a normal database 11.2.0.3, which I also happen to use for EM12c. So I figured I would take some load off the little atom server, keep it running Oracle VM Manager but shut down XE and move the schema over to my dedicated database host. This is a straightforward process so I just wanted to list the steps.

1) shut down Oracle VM Manager so that it does not continue updating the repository.
as root : /etc/init.d/ovmm stop

2) export the schema user using the exp command for Oracle XE
as oracle : 
cd /u01/app/oracle/product/11.2.0/xe
export ORACLE_HOME=`pwd`
export ORACLE_SID=XE
export PATH=$ORACLE_HOME/bin:$PATH
exp
(enter user ovs and its password)
export user (option 2)
export everything including data
this will create (by default) a file called expdat.dmp
copy this file over to the other server with the other database
The schema name is also in /u01/app/oracle/ovm-manager-3/.config (OVSSCHEMA)

3) shutdown oracle-xe as it's no longer needed  
as root : /etc/init.d/oracle-xe stop

4) import the ovs user into the new database. I like to do it as the user. 
I just simply pre-create the schema before starting import
as oracle : 
sqlplus '/ as sysdba'
create user ovs identified by MyPassword;
grant connect,resource to ovs;
at this point, run the imp utility on the box to import the expdat.dmp
import asks for username/password, enter ovs and its password
import yes on all data and tables and content.

At this point you have a good complete repository. 
Now let's make the Oracle VM Manager weblogic instance point to the new database.

5) on the original system, restart weblogic
as root :/etc/init.d/ovmm start
wait a few minutes for the instance to come online

6) use the ovm_admin tool
as oracle : 
cd /u01/app/oracle/ovm-manager-3/bin
./ovm_admin --modifyds orcl wopr8 1521 ovs mypassword
My new host name for the 11.2.0.3 database is called wopr, 
the database instance is orcl and listener is still 1521 with schema ovs
The admin tool asks for a password, this is the weblogic user password. 
In a simple install, this would be the same as your admin or ovs account password.

7) restart to have everything take effect.
as root : 
/etc/init.d/ovmm stop  ; sleep 5 ;/etc/init.d/ovmm start ;

8) edit the config file and update the new data 
vi /u01/app/oracle/ovm-manager-3/.config 
modify :
DBHOST=
SID=
LSNR=
OVSSCHEMA=
and leave the rest as is. 

that should do it !

04 Feb 2012 7:36pm GMT

Peter Van Eynde: IPv6 versus IPv4 at fosdem :S



how so?

pevaneyn-mac:wireshark pevaneyn$ traceroute v4.fr.ipv6-test.com
traceroute to v4.fr.ipv6-test.com (46.105.61.149), 64 hops max, 52 byte packets
 1  193.191.79.254 (193.191.79.254)  6.215 ms  0.282 ms  0.244 ms
 2  ge.ar1.brucam.belnet.net (193.191.4.49)  0.350 ms  0.325 ms  0.365 ms
 3  10ge.cr2.bruvil.belnet.net (193.191.16.189)  1.143 ms  0.964 ms  0.994 ms
 4  ovh.bnix.net (194.53.172.70)  2.396 ms  1.900 ms  1.942 ms
 5  rbx-g2-a9.fr.eu (94.23.122.137)  5.712 ms  4.725 ms  4.794 ms
 6  rbx-2-6k.fr.eu (91.121.131.9)  10.489 ms  15.149 ms
    rbx-1-6k.fr.eu (91.121.131.13)  50.591 ms
 7  rbx-26-m1.fr.eu (213.251.191.201)  4.448 ms
    rbx-26-m1.routers.ovh.net (213.251.191.73)  4.754 ms  4.996 ms
 8  eight.t0x.net (46.105.61.149)  3.950 ms  3.975 ms  4.067 ms
pevaneyn-mac:wireshark pevaneyn$ traceroute6 v6.fr.ipv6-test.com
traceroute6 to v6.fr.ipv6-test.com (2001:41d0:1:d87c::7e57:1) from 2001:6a8:1100:beef:114f:fb76:XXXX:XXXX, 64 hops max, 12 byte packets
 1  2001:6a8:1100:beef::1  0.558 ms  0.674 ms  0.507 ms
 2  2001:6a8:1000:800f::1  0.370 ms  0.414 ms  0.393 ms
 3  10ge.cr2.bruvil.belnet.net  1.106 ms  1.112 ms  1.034 ms
 4  ae0-200.bru20.ip6.tinet.net  1.620 ms  1.572 ms  1.523 ms
 5  xe-2-1-0.ams20.ip6.tinet.net  6.063 ms
    xe-5-2-0.ams20.ip6.tinet.net  5.999 ms
    xe-8-1-0.ams20.ip6.tinet.net  6.002 ms
 6  * * *
 7  * * *
 8  * * *
 9  fra-5-6k.de.eu  25.602 ms *  30.531 ms
10  rbx-g2-a9.fr.eu  31.890 ms  27.448 ms  26.656 ms
11  rbx-1-6k.fr.eu  29.996 ms
    rbx-2-6k.fr.eu  33.715 ms
    rbx-1-6k.fr.eu  26.735 ms
12  2001:41d0:1:d87c::7e57:1  25.498 ms  31.873 ms  30.815 ms



So a trip around Europe. But IPv6 needs not be slow:

pevaneyn-mac:fosdem pevaneyn$ traceroute6 www.debian.org
traceroute6: Warning: www.debian.org has multiple addresses; using 2001:858:2:2:214:22ff:fe0d:7717
traceroute6 to www.debian.org (2001:858:2:2:214:22ff:fe0d:7717) from 2001:6a8:1100:beef:114f:fb76:XXXX:XXXX, 64 hops max, 12 byte packets
 1  2001:6a8:1100:beef::1  0.640 ms  1.731 ms  0.607 ms
 2  2001:6a8:1000:800f::1  0.491 ms  0.356 ms  0.387 ms
 3  2001:6a8:1000:2::2  0.442 ms
    10ge.cr2.bruvil.belnet.net  1.081 ms  0.989 ms
 4  10ge.cr1.brueve.belnet.net  1.979 ms
    10ge.cr1.brueve.belnet.net  1.718 ms  1.479 ms
 5  20gigabitethernet1-3.core1.ams1.ipv6.he.net  4.766 ms  8.460 ms  7.190 ms
 6  10gigabitethernet1-1.core1.fra1.he.net  16.977 ms  20.783 ms  11.835 ms
 7  ge2-19-decix-ipv6-c1.ix.sil.at  70.823 ms  42.928 ms  45.012 ms
 8  2001:858:66:203:215:2cff:fe8d:bc00  27.416 ms  26.934 ms  28.561 ms
 9  ip6-te1-4-c2.oe3.sil.at  26.776 ms  26.413 ms  26.856 ms
10  2001:858:66:22c:217:fff:fed4:6000  27.156 ms  27.472 ms  26.778 ms
11  englund.debian.org  27.211 ms  27.641 ms  27.823 ms
pevaneyn-mac:fosdem pevaneyn$ traceroute www.debian.org
traceroute: Warning: www.debian.org has multiple addresses; using 86.59.118.148
traceroute to www.debian.org (86.59.118.148), 64 hops max, 52 byte packets
 1  193.191.79.254 (193.191.79.254)  0.619 ms  0.254 ms  0.255 ms
 2  ge.ar1.brucam.belnet.net (193.191.4.49)  0.432 ms  0.385 ms  0.448 ms
 3  10ge.cr1.brueve.belnet.net (193.191.16.205)  1.153 ms  1.557 ms  0.951 ms
 4  nl-asd-dc2-ias-csg01.nl.kpn.net (195.69.144.144)  5.608 ms  5.442 ms  10.251 ms
 5  * * *
 6  ffm-s1-rou-1021.de.eurorings.net (134.222.229.10)  38.019 ms  37.926 ms
    ffm-s1-rou-1021.de.eurorings.net (134.222.231.250)  39.953 ms
 7  ffm-s1-rou-1022.de.eurorings.net (134.222.228.86)  40.075 ms
    ffm-s1-rou-1022.de.eurorings.net (134.222.228.90)  38.180 ms
    ffm-s1-rou-1022.de.eurorings.net (134.222.228.86)  42.755 ms
 8  mchn-s1-rou-1022.de.eurorings.net (134.222.228.194)  33.019 ms  33.211 ms  37.045 ms
 9  wien-s2-rou-1002.at.eurorings.net (134.222.228.46)  39.827 ms  37.795 ms  39.839 ms
10  wien-s2-rou-1041.at.eurorings.net (134.222.123.242)  37.581 ms  37.633 ms  39.505 ms
11  sil.cust.at.eurorings.net (134.222.123.150)  37.654 ms  35.650 ms  35.521 ms
12  englund.debian.org (86.59.118.148)  38.009 ms  38.124 ms  40.628 ms



This entry was originally posted at http://pvaneynd.dreamwidth.org/148844.html. Please comment there using OpenID.

04 Feb 2012 4:13pm GMT

FOSDEM organizers: FOSDEM dance

Unfortunately, due to time constraints we were unable to entertain the crowd with our usual FOSDEM dance.
To make up for this, we have rescheduled it to after the closing talk.

04 Feb 2012 11:14am GMT

Frank Goossens: Fiesta: WP YouTube Lyte reaches 1.0.0

I just released the one dot ohhhh dot ohhhhhhhhhh version of WP YouTube Lyte!

From the changelog:

And an appropriate vid to go with this new release:

<noscript><a href="http://youtu.be/rHdrwIffcWw"><img alt="" height="340" src="http://img.youtube.com/vi/rHdrwIffcWw/0.jpg" width="640" /></a> Embedded with WP YouTube Lyte.</noscript>
Watch this video on YouTube or on Easy Youtube.

Possibly related twitterless twaddle:

04 Feb 2012 7:23am GMT

03 Feb 2012

feedPlanet Grep

FOSDEM organizers: Schedule changes

The following are last-minute changes and are not in the booklet or printed schedule:

Saturday

Opening talk:

Open Mobile Linux devroom:

Virtualization and Cloud devroom:

Sunday

Telephony and Communications devroom:

Free Java devroom:

Perl devroom:

Graph Processing devroom:

read more

03 Feb 2012 8:55pm GMT

Claudio Ramirez: Perl devroom @ FOSDEM2012

fosdemJust a short reminder of the Perl talks at FOSDEM2012.

The Perl dev-room will be held this Sunday February 5th, from 9 to 17h on room AW1.121. We have a wide range of talks. Some talks target Perl programmers with subjects ranging from a beginner to an advanced level. Other talks don't focus on the language itself, but rather on projects that use Perl as a building stone.

So please, drop by if you are at FOSDEM…

Room: AW1.121
Sunday 2012-02-05
Event Speaker Room When
Welcome to the Perl devroom Claudio Ramirez AW1.121 09:00-09:05
Moose Primer Nicholas Perez AW1.121 09:05-09:25
Advanced Moose Techniques Nicholas Perez AW1.121 09:35-09:55
Perlude: a taste of Haskell in Perl Marc Chantreux AW1.121 10:05-10:45
Perlito Flávio Glock AW1.121 11:05-11:45
The LemonLDAP::NG Project Clément Oudot AW1.121 11:55-12:15
LedgerSMB: Open source accounting running on Perl Erik Huelsmann AW1.121 12:25-12:45
Modern PerlCommerce Stefan Hornburg AW1.121 13:25-14:05
Rapid real-world testing using git-deploy Ævar Arnfjörð Bjarmason AW1.121 14:15-14:35
POSIX::1003 Mark Overmeer AW1.121 15:00-15:40
The FusionInventory Project Guillaume Rousse AW1.121 15:50-16:10
Using Moose objects with Memcached Marius Olsthoorn AW1.121 16:20-16:40


Filed under: Uncategorized Tagged: dev-room, fosdem, FOSDEM2012, Perl

03 Feb 2012 8:29pm GMT

Xavier Mertens: Get The Most of Your Monitoring/Security Tools!

Use the right toolThe idea of this article popped in my mind after a colleague of mine asked me to investigate a security incident. Nothing brand new, a customer's server not properly patched and secured was pwned. I found that the server was hit by the JBoss worm which started to spread in October 2010. Then the server started to scan for other victims, etc. Why was the server not patched and why it was able to access Internet directly, I don't know. I won't start a new debate here. I just would like to insist on the ways (read: tools) that can be used to detect such incident at the right time.

When I started my investigations, I had a limited number of data sources: The firewall logs and a network monitoring appliance. No log management solution and the server was turned off "to avoid more problems" (OMG!). The firewall logs gave me of course some relevant information but what about the network monitoring appliance? This is the same kind of appliance that I'm using during the BruCON conference to keep an eye on the visitors traffic. Very nice statistics can be generated. Basically, this appliance performs three tasks:

My investigations continued on this appliance and, as you can imagine, I found a multitude of evidences:

By having a look at the information reported by the appliance, the customer could at an early stage (even in real-time!) be alerted of the attack. But those features were simply… not used! The appliance was installed to monitor the network performances, that's it! But it could do much more!

That's an effect of the "Microsoft Syndrome"! What is this? I found a good definition on computerworld.com:

"There are several symptoms. One is when a tech company becomes so successful in a market and grows so quickly that it overlooks potential new markets. Another is when a tech company gets so large that it becomes increasingly difficult for it to innovate."

From my point of view, I would like to extend this definition on the technical aspect of IT products:

"Another symptom is when a software becomes so complex that you only use a few percentage of its features and forgot or don't know how to use the others."

A typical example is Microsoft Word. I'm a Word user but, honestly, I must use 10% of all the features! Sometimes, I'm working on RFP which go very deep in the feature requirements and, finally, most of them will remain unused or unimplemented.

I think it's time to remind the principle of "more with less". Implementing security solutions is very expensive and budgets are often frozen or reduced. If you put some (lot of) bucks into a solution, be sure to use it at 100%! Read the manuals (you know, "RTFM!"), follow trainings, invest some time! Sometimes, cool features could be used for other purposes and increase the ROI! This reflexion goes in the same direction as one of my previous article about implementing security controls using Nagios.

03 Feb 2012 4:33pm GMT

Frank Marien: Extremon Unveiled

Ah, "Monitoring"

It certainly means different things to different people:

As sysadmins, we want to know our systems are ok, to the slightest detail, and if not, what is wrong. Preferably before, or at least while it happens. As (enlightened) developers we want to be able to follow our application's behaviours in production. As service managers we want to know if we're delivering the service as agreed. What's up, down, for how long, how slow, who's to blame. As managers we might want to know the bottom line, of how many downloads, sales,..in a pretty Widget. You can probably think of a few more. And you're probably not happy with that you have (or you wouldn't have monitoringsux), which is most likely different systems runing besides each other, with different paradigms, platform-dependent API's (if any), Different Web GUI with lineair lists and state colors, that you force to refresh ever few minutes, and even then look at old information, and that have credentials to authenticate to the systems they monitor, making them dangerous points of failure in terms of security.

Depending on what you're trying to monitor, you may be OK with all of these.

But if you're like us, you'll end up in a multi-everything (platform, application, networks, silos, sites, policies) environment with no end of interdependencies, where at least some applications are interactive and time-critical, and the sysadmins and developers are collegues, horizontal team members, or all the same people. This is the type of environment that we're growing Extremon for.

Taking 3 important headlines from the Extreme Monitoring Manifesto

Live, with Subsecond temporal resolution

Most of the data you're gathering will be required for different purposes. The service response time and validity you're testing in a functional test tells the sysadmins in the data center that it's fast enough, tells the developer that his caching strategy works in production, tells the service manager that you're ok with the SLA, tells the 1st and 2nd line support *at one glance* that the problem isn't the server, etc.. it makes sense to gather the data only once, which gives you breathing room to gather it more intensely. I propose starting at one probe per second, which is peanuts for most modern systems, but which will give you data points at the highest resolution you'll ever want (you can always average etc.. over longer periods for different uses). I find that services that have issues with one second probes are in deep trouble anyway, and should be rethought. Of course you shouldn't come up with the heaviest possibly data or query set on purpose. But for normal use.. 1/sec.. is really nothing.

Agent Push really is the only option for system metrics, at that speed, and that's fine: Provisioning agents is a near zero cost game given DevOps practice, agent push solves the monitoring security issue in one fell swoop, requiring no connections from the monitoring hosts to the agent, hence, no authentication, no technical users, no flaws to exploit, and no endless login/measure/logout sequences wasting CPU slices and network traffic.

I currently favour collectd because it's fast, light, pluggable, and has a very efficient network protocol. We provision our collectd's using puppet, and have them push their metrics to multiple monitoring hosts on the Internet every second. Yes, you read that right, collectd uses UDP, so we're pushing UDP over the Internet. I hear you cry in horror that "Packets may get lost". Yes they may, and yes they do. But that's OK, the data will come in a few seconds later. It's no big deal. We've chosen to use the signing and encryption options, because we're paranoid and proud of it. We have our collectd's gather all the usual system data, but also application-specific metrics, from applications that support this, and e.g. JVM memory metrics.

The monitoring hosts have collectd instances in listening mode, so they get (most of) the collected data, which gives us the view from *inside* the hosts. Also, the monitoring hosts run any and all kinds of custom service tests, exercising the Internet-published services from the outside. This is the external view: What will the end-user experience. These tests push their results into the same collectd instances, meaning these now have all the relevant metrics.

Hot-pluggable components

As much as we love collectd's efficient binary UDP protocol, we want the simplest possible protocol, and that isn't.

Using a small collectd write plugin, we write whatever collectd gathers to a multicast group, UDP again, in the simplest format we could find: label-value pairs. The protocol is this:

Metrics are grouped in "shuttles". Each shuttle consists of a number of lines, followed by a blank line.
Each line consists of a label, an equals sign, a value, and a carriage return.

a label represents a reverse-fqdn of your internet domain, followed by whatever hierarchical representation you see fit. here's some lines from a shuttle:

be.apsu.prod.eridu.df.var.df_complex.reserved.percentage=5.16165872485
be.apsu.prod.eridu.df.opt.df_complex.reserved.percentage.state=0
be.apsu.prod.eridu.df.opt.df_complex.free.percentage.state.comment=More Than 60% Free Space
be.apsu.prod.eridu.df.home.df_complex.reserved.percentage.state=0
be.apsu.prod.eridu.df.tmp.df_complex.reserved.percentage=5.1617400345
be.apsu.prod.eridu.apsu_be.https.httpprobe.responsetime=127.850000
be.apsu.prod.eridu.apsu_be.http.httpprobe.responsetime=49.020000

The plugin adds a timestamp in ms, so every shuttle has one (not shown above)

Since these are multicast (with a ttl of zero), any process on the same monitoring host can join that multicast group and read all the metrics from all the collectd and custom agents. Here's where filters can clean up the namespace where necessary, contributors can translate values into states and trends, trends into states, states into alerts, and aggregators can contribute calculated values. Contributions just go back into the cauldron. For example, the "percentage" metrics in the example above is contributed by a "df" aggregator which takes reserved, free, and in use metrics and calculates their equivalent percentages."percentage.state" and "percentage.state.comment" are contributed by a "df.state" contributor that decides which percentage values are OK, for which disks.

We call this multicast group "The cauldron", since this is where all the ingredients are added and transformed. The nice thing about the multicast group, is that it's easy to plug into, live, easy to read and write from, by any process, in any language, without interrupting anything else, and we get an extraordinarily robust and proven implementation of it with any GNU/Linux we install.

In the cauldron, any metric (and all it's derived values, such as states, aggregates, etc..) appear each time the metric is received, and all metric appear, for the entire namespace, so the cauldron may "boil" intensely if you add many metrics. For example, the cauldron on each of the 2 monitoring hosts we're working on today, "boils" at about 5000 metrics per second. it only looks intense when you look at it. To the machine, that's only 64Kbyte/sec, even without compression.

To add more hosts, for scaling, we would simply connect them using Ethernet, and set the ttl to 1 instead of 0, to allow the multicast out of the host. But we're far far away from needing that kind of scaling, at this point.

Simple Text-based Internet-Friendly Subscription Push API

One type of process in the cauldron allows multiple TCP connections, reads a simple HTTP URL, consisting of the /-separated namespace, and serves shuttles conforming to that URL on a TCP connection, starting off with a complete set of all the current values, followed by updates. This allows any application to subscribe to the metrics it needs, and update a local cache. (Or not. If you were writing that Widget, you might not even keep any cache, just update the widget as the data evolved) We serve this with an apache webserver in front, to handle security and encryption.

Let's see how idle our CPU's are for these 2 systems (app1 and app2):

$ wget https://<hidden>/*/cpu/*/cpu/idle/value --user.. 

hidden.app2.cpu.0.cpu.idle.value=89.2023
hidden.app1.cpu.0.cpu.idle.value=88.32
hidden.app2.cpu.1.cpu.idle.value=99.1911
hidden.app2.cpu.0.cpu.idle.value=91.8071
hidden.app2.cpu.1.cpu.idle.value=99.8242
hidden.app1.cpu.0.cpu.idle.value=93.0782
hidden.app1.cpu.1.cpu.idle.value=93.5785
hidden.app2.cpu.1.cpu.idle.value=97.9927
hidden.app1.cpu.0.cpu.idle.value=86.2266
hidden.app1.cpu.1.cpu.idle.value=96.7542

The first 4 lines are the values at connection time, the rest of the lines are updates.. Since we're measuring at 1Hz, and CPU values tend to change all the time, we get updates every second.

Let's look inside an application (we've set up collectd to takes these snmp measurements on the server in question)

$ wget https://<hidden>/app1/snmp/counter  --user=..
hidden.app1.snmp.counter.validations.value=1.99592
hidden.app1.snmp.counter.cache_misses.value=0
hidden.app1.snmp.counter.cache_hits.value=2.49491
hidden.app1.snmp.counter.cache_refreshes.value=0
hidden.app1.snmp.counter.validations.value=3.48378
hidden.app1.snmp.counter.cache_hits.value=5.47449
hidden.app1.snmp.counter.validations.value=3.51587
hidden.app1.snmp.counter.cache_hits.value=4.52042

All production disk usage, live:

$ wget https://<hidden>/prod/**/df/**/free/percentage--user=..

etc.. etc..

A python client to this is about 23 lines and uses only standard classes.
Of course we have and we'll maintain a few reference clients in various languages.

Display on a meaningful representation, and in real-time.

Web Pages were intented to convey static documents with links between them. Stretching that metaphor only goes so far, and I don't think web pages are an appropriate medium to convey real-time data (but that's my opinion (fr4nkm), koendc has different ideas, and is working on Javascript-based clients, which, I must say, look pretty impressive) Also, we want a "meaningful representation" which implies that for anything more complex than a single server we want to get away from HTML-driven lists and status colours, and move to a full schematic of the systems we're monitoring, and their connections.

Drawing a full top-level schematic of one's systems is something I have found both extremely useful and relatively rare. Many systems have grown organically with their organisations and have never even thought of drawing such. This makes it very hard for anyone to get a good idea of how the whole functions, and encourages silo-type thinking with everyone just looking at their little part of the world. While there's nothing wrong with drawing a partial diagram, I find the minimum should be the "big picture".

Once you have the "big picture", why not use it to project monitoring data? In that way, you can immediately tell where the data that you're looking at fits into the whole, and, from the states of the connected systems, make deductions about what is going on and what the impact is. This allows for faster triage, and for experts in different domains to gather around the same view of the whole system, and look at their own details, while not loosing sight of the connections.

For example: a web service goes "unusable" in the remote functional test, becomes red and glowy on the display. A sysadmin looks at it (and right clicks it to indicate she's looking into it - see below), zooms in to find the exact measurements, finds that the connection times out. The same functional test, from the inside, that is right next to it, remains OK, responding within a few ms. Triage indicates that this is a network issue, that all the application and backends are fine (and they do show as green), so she gets the network expert to look, he zooms are other parts of the system that are monitored.. If the same service had been merely slow, she would have zoomed in on the warnings in the application server, where she might have found many cache misses, for example, due to some backend problem. If not, she might have asked the developer or application specialist to zoom in on the application metrics.All on the same screen, if required, from anywhere in the world with a reasonable TCP connection, if need be.

The display we're developing uses SVG to display schematics, with the home view being the largest supersystem that we monitor. Here we see systems and their boundaries, and some services and response times of the most important externally offered services, if any. The response times are live, if we see a bar graph shoot outwards and grow red, we know that service is at least slow. If a host goes yellow, it may have a disk space or CPU usage issue, if an application goes red, it may have fatal errors in it's log file, JMX, SNMP or other metrics. The point is that SVG are vector graphics, and that we can have any amount of detail hidden in our larger schematic, and zoom into any part to find that details. Host disk space need not be represented large enough to be readable from the home view, as long and the host state shows a problem, we can zoom on it to make more detail visible, while mentally retaining the link between that host, it's applications, and the whole system. This is a far cry from seeing a red icon next to APPSRVWEB001_VAR_TMP_FREE, and having to mentally make that link.

Also, we can easily give anyone who might want it a read-only view of our systems. It makes a great deal of difference to your customer service experience, if a service desk agent, having the same overview capability, can tell a calling customer, with confidence, where the problem is (not) located, that someone is working on it (and perhaps, who), knowing who to contact, whereas, without that capability, they have to "get back" to the customer, leaving the latter the impression that we're not monitoring, at all.

home view overview, with some functional tests

zoomed in on a host, CPU and disks

Detail of 3 functional tests

Implicit Provisioning (Test-driven infrastructure)

When we were provisioning machines manually, it followed that we provisioned monitoring manually, as well. You don't have the info in a machine-readable format, you cannot parse it.I know organisations that have 2-3 FTE just working on provisioning monitoring solutions (with Web interfaces, click, click click all day, doing the same drudge work).

Now that we've moved to automated provisioning, it would make a lot of sense to handle monitoring as much as possible from that same angle. Ideally, what we want is for monitoring to set itself up, from the same machine-readable descriptor that will set up the actual infrastructure to monitor, before that happens. We call this "test-driven infrastructure" just as it is called test-driven-development in the XP methodology: You write a test (but the information is already largely in your puppet or other description), monitoring starts, the infrastructure and all it needs to support appears in the namespace and on the screens, with all the states in ALERT because nothing is working, obviously. Then, as the VM's, OS, and services appear, states go to OK. At the end, you know your system is OK, because all is green, just as with TDD, you know your code is OK, because all your tests are green.

We haven't done much on this side of the equation. Much of the extremon configs are still in hard-coded object graphs (designed to be instantiated from textual descriptors, but this is not yet implemented), and so there is a lot of manual provisioning, in there, still.

Graphing

We don't have anything of our own, and we don't want to reinvent the wheel. We've connected a carbon engine to the cauldron, and it happily keeps track of and graphs our 5K metrics/sec. (but we had to do some tuning). Ideally, we should get our display code to display those graphs.

Schematic Overview

Rough Overview

The Code So Far

I'm consolidating our 4 private github repos to create new, public ones, by or at the #monitoringsux hackathon.

Done: https://github.com/m4rienf/ExtreMon-Display
Done: https://github.com/m4rienf/ExtreMon
ToDo: Koen's Javascript clients, Java browser namespace browser applet

03 Feb 2012 3:22pm GMT

Guy Van Sanden: Why does the upgrade-manager in precise insist on removing skype?

After upgrading to Precise, I noticed that Skype was uninstalled. But it was easily fixed by downloading the deb from Skype's site.

But now, at each update via-update manager, it says the skype package should have been removed and I need to remove it before proceeding?

Is this a bug? Any workaround?

03 Feb 2012 9:39am GMT

FOSDEM organizers: Friday build-up

FOSDEM is almost upon us.

We will begin building up the ULB campus on Friday at 13:00. If you are around and want to help out, do join us!

Most work could be finished by 18:00, if you are hesitating to join in the late-afternoon, check this post whether that's still needed.

read more

03 Feb 2012 12:09am GMT

02 Feb 2012

feedPlanet Grep

Frank Goossens: Is Lana del Rey een Meat Puppet?

De Meat Puppets schreven het, maar Nirvana stal er de show mee:

<noscript><a href="http://youtu.be/zh1lce1PwmY"><img alt="" height="340" src="http://img.youtube.com/vi/zh1lce1PwmY/0.jpg" width="640" /></a> Embedded with WP YouTube Lyte.</noscript>
Watch this video on YouTube or on Easy Youtube.

En Lana Del Rey, da's ook een vleespop, luister maar;

<noscript><a href="http://youtu.be/HO1OV5B_JDw"><img alt="" height="340" src="http://img.youtube.com/vi/HO1OV5B_JDw/0.jpg" width="640" /></a> Embedded with WP YouTube Lyte.</noscript>
Watch this video on YouTube or on Easy Youtube.

Dat horen van vage gelijkenissen is misschien een kleine afwijking, maar … seriously Lana?

Possibly related twitterless twaddle:

02 Feb 2012 5:21pm GMT