12 Nov 2011
Planet Apache
Simone Tripodi: An open source Pizza!
As a typical Italian, food for me is part of my culture, I love eating at the same level that I love cooking. Please don't think we are crazy!
The Mediterranean Diet is NOT a kind of religion but IMHO a very good guideline that allows people eating good, tasty, without the risk of getting easily fat - but please pay attention, if you eat 1Kg of pasta each day, you will :)
Pizza is, as everybody knows, one of the most typical Italian food - where there is pizza, there is always a little part of Italy around!
You are maybe wondering what does this concerns with OpenSource development... my foreign friends maybe think that Italian cook has a lot of secrets to be so tasty, but... this is false! Italian cook is simple, pragmatic and, above all, based on healthy ingredients. So let's "open the source" of a good Italian pizza!
Preface: this technique aims to show how to make a high digestibility pizza, so it requires at least 48h of rising!
So, let's chose first the ingredients (pizza for 4):
* 400g of wheat flour 00 (you can optionally use the 10% of whole wheat flour);
* 320g of water;
* ~3g of dry yeast;
* 10g of salt;
* 16g of olive oil;
Once collected the ingredients, you are now ready for mixing! Let's see how a professional Pizzaiolo does:
Once completed, let the mix rest for 10 minutes, then you can start applying the technique called "regenerate" for 5 times, one each 10 minutes:
Once done, cover the pizza as shown in the video, put it in the lower in the refrigerator and say goodbye for 48h :)
OK, 48h have passed, you are now hungry enough and ready to cook it :) Don't forget to pull out the refrigerator the pizza 4-5h before cooking it.
When ready to cook, let's drawing it up first
Then cook and spice it:
Do you see how easy it is? That's the way we do at home. Do you want to see my results? Visit my Apache Pizza album on Picasa!
Have a nice OpenSource pizza, buon appetito!
12 Nov 2011 2:56pm GMT
Jean-Baptiste Onofré: ApacheCon NA11 2011: so great
Sad, too early
. You know this feeling when something great is ended.
I was at the ApacheCon NA11, Vancouver this week, and it was simply awesome.
I gave two talks:
- Apache ServiceMix future (http://people.apache.org/~jbonofre/smx_future.ppt)
- Deployment with Apache Karaf and ACE (http://people.apache.org/~jbonofre/deploy_karaf_ace.ppt)
On the other hand, I attended to the following session:
- Business and Open Source, open discussion with especially Bertrand (Delacretaz), Debbie (Moynihan), Ross (Turk)
- Apache, Branding and Trademark, with Shane (Curcuru)
- Archiva by Brett (Porter)
- Whirr by Tom (White)
- DOSGi and cloud by Guillaume (Nodet)
I discussed with a lot of people:
- Of course Dan (Kulp), Hadrian (Zbarcea), Ross (Turk)
- Guillaume (Nodet), it was really great to work together, discussed about our roadmap for Karaf and ServiceMix, etc. Thanks a bunch Guillaume

- Christian (Muller) and Jon (Anstey), we talked about Camel. These guys are so cool

- Mohammad (Nour El-Din) about projects in the incubator, proposals, etc
- Brett (Porter) and Carlos (Sanchez) about Archiva
- Marcel (Offermans), Alex, Bram, about ACE and OSGi
- lot of others (Bertrand, Shane, etc, etc) that I would like to thank too
Once again, thank you all.
See you in November 2012 for the ApacheCon Europe, Germany ![]()
12 Nov 2011 1:08pm GMT
Claus Ibsen: Apache Camel 2.9.0-RC1 Released
The Camel team is working hard on the last pieces for the upcoming Apache Camel 2.9.0 release. In the mean time we decided to cut a release candidate; due to some larger changes like core API refactorings, Spring dependency changes, rewritten simple expression language, etc.
We would highly appreciate any feedback from the community in terms of any upgrade glitches, or other issues discovered in the release candidate.
The release is available to download from Apache, and as well from Central Maven repo.
For the release notes we suggest to take a look at the current in-progress release notes for the 2.9.0 release.
12 Nov 2011 10:48am GMT
James Duncan: Flying rhinos
How do you move a Rhino? Check out how the WWF is using helicopters as part of their toolset to move rhinos as part of their Black Rhino Range Expansion Project.
Linked by James Duncan Davidson.
12 Nov 2011 3:48am GMT
11 Nov 2011
Planet Apache
Justin Mason: Links for 2011-11-11
-
Determining response times with tcprstat : 'Tcprstat is a free, open-source TCP analysis tool that watches network traffic and computes the delay between requests and responses. From this it derives response-time statistics and prints them out.' Computes percentiles, too
(tags: tcp tcprstat tcp-ip networking measurement statistics performance instrumentation linux unix tools cli)
11 Nov 2011 11:58pm GMT
Bryan Pendleton: Now THAT'S a data center!
Here's a fun story about the Switch Communications "SUPERNAP" data center in Nevada. Switch claims it is "the world's best data center" and they have the stats to justify their claim.
These Internet-scale datacenters have really taken off in recent years. Last month the Open Compute community held their second Open Compute Summit, and part of that effort was the establishment of a foundation to guide the work as it moves forward; read more about that effort here. I haven't seen too much technical information flowing from the Open Compute Summit, although James Hamilton of Amazon posted his slides online here: here
Meanwhile (was this part of the summit, or independent?), the team at AnandTech have done some independent testing of the Open Compute server components; in their conclusion, they commend the Open Compute work as showing tremendous potential:
The Facebook Open Compute servers have made quite an impression on us. Remember, this is Facebook's first attempt to build a cloud server! This server uses very little power when running at low load (see our idle numbers) and offers slightly better performance while consuming less energy than one of the best general purpose servers on the market. The power supply power factor is also top notch, resulting in even more savings (e.g. power factoring correction) in the data center.
While it's possible to look at the Open Compute servers as a "Cloud only" solution, we imagine anyone with quite a few load-balanced web servers will be interested in the hardware. So far only Cloud / hyperscale data center oriented players like Rackspace have picked up the Open Compute idea, but a lot of other people could benefit from buying these kind of "keep it simple" servers in smaller quantities.
Lastly, since much of the activity in this area of computing has to do with power efficiency, let me draw your attention to this interesting work on power management in Android.
Cheaper, faster, and more power-efficient: the future of computing beckons!
11 Nov 2011 10:48pm GMT
FeatherCast: ApacheCon NA 2011 – Thursday
Here's the audio from all (well, most, anyway) of the tracks on Thursday at ApacheCon North America 2011 in Vancouver, British Columbia, Canada.
Thanks go to Hugh Brown and Bertrand Delecretaz for their assistance in editing these tracks, so that it didn't take me all day today.
Plenaries
Fast Feather
Track A - Enterprise Java
- Apache ServiceMix future - Jean-Baptiste Onofré
- An architecture for enabling multi-tenancy for Apache Axis2 - Afkham Azeez
- An Interactive Example of Enterprise SOA, Apache Style - Hadrian Zbarcea
- ActiveMQ In Action: Common Problems and Solutions - Bruce Snyder
- Security Problems (and Solutions) for Service Oriented Applications - Daniel Kulp
Track B - Data Handling - Big Data
- The Past, Present and Future of NOSQL - Emil Eifrem
- Scaling Hadoop Applications - Milind Bhandarkar
- The other Apache technologies your big data solution needs - Nick Burch
- Apache Mahout for intelligent data analysis - Isabel Drost
- Dr. Mahout: Analyzing clinical data using scalable and distributed computing - Shannon Quinn
Track C - Community
- The Secret Life of Open Source - Ted Husted
- Talking people into creating patches - Isabel Drost
- Navigating the Apache Incubator - Brett Porter
- Life in Open Source communities - Bertrand Delacretaz
- Chefs with Feathers: The Sakai Project - Carl Hall
Track E - Servers (HTTPD)
- Apache httpd 2.4: The Web Server for the Cloud - Jim Jagielski
- mod_lua for beginners - Eric Covener
- The Power of the mod_proxy Modules - Paul Weinstein
- Hardening Enterprise Apache Installations Against Attacks - Sander Temme
- Out and About with Apache Traffic Server - Leif Hedstrom
Track F - Content Technologies
- If you have the content, Apache has the technology - Nick Burch
- Apache Tika: 1 point Oh! - Chris Mattmann
- ManifoldCF for Content Acquisition - Karl Wright
- Interoperability with CMIS and Apache Chemistry - Florian Müller
- Handling RDF data with Apache Jena - Paolo Castagna
11 Nov 2011 10:32pm GMT
Christian Grobmeier: Dart on the W-JAX 2011
Thanks to my friends from Software & Support Media I have been invited to the W-Jax 2011 to speak about Dart. This was very exciting and a great experience! The room was full with ~70 people, most of them Java developers. We talked about the current state, some syntax and the future of Dart. Finally I have got the impression that there is a huge interest in Dart. After the show, I have been asked to give a short interview for the Jaxenter. Here we go (in bavarian german
).
Of course some of you won't be able to understand german. Therefore I have tried to translate the important parts for you. The questions of the Jaxenter are marked bold. As always, feedback welcome!
You just had a Session on Google Dart - how would you characterize Dart?
Dart is a scripting language from Google, which has the goal to replace JavaScript one day. This sounds pretty arrogant at first glance. But if you look at Dart a bit closer, you'll see it is a nice, well designed language. I think it is a scripting language which is esp very usable for Java developers. It offers a chance for server side Java developers to create cool Frontends, which is usually done by JavaScript people only.
Which features are most interesting?
First, there are classes and interfaces. Many Java developers probably always wanted classes in JavaScript.
Then there are Isolates. Isolates are like Java Threads. In JavaScript there is usually no option to work with multithreading. In Dart this is different: Isolates work very well already.
Then there is a VM. You can write server applications with Dart. Now Browser- and Server-developers can use a single language.
How does the typing system look like?
There are Types in Dart - but they are optional. You can use Types for debugging or for documentation purposes. But you need to enable Types on program launch. If you don't activate them you can use wrong types and Dart will not fail.
At the moment Dart is a preview version - what are the problems currently?
It's running very well - you can't say there are problems. Anyway, community is small. There is feedback - but it could be more. There is a fundament of Dart - but important features like Database access or File access is still in development - and this takes time.
On the committers list are 5 people named - probably there are some Google internal devs. But anyway, this is a small team. Now they must decide on the communities feedback and work on the important features.
Dart has been announced as an open language - how is reality? You said, 5 people are programming it, how open is it actually?
Yes, Dart is Open Source in terms of source code is open. You can look at the issues or participate the project on the mailinglist. But it is not open development. This means, many decisions are taken internally. Maybe that decisions are based on community feedback, but after all the decision votes are not open for all.
Unfortunately many elemental questions, like: were are we next year? or: what's going on with Browser plugins? are discussed internally too.
Therefore you need to take care.
Probably there will be an development model like at the Apache Software Foundation.
Or maybe there will be a JCP like process like Oracle has with Java.
Or maybe it will stay is it is now, and the community can only provide patches.
My big hope is it will be like at The Apache Software Foundation.
Dart wants to be a replacment for JavaScript - do you see a chance here? After all JavaScript is very widley spread?
Of course, JavaScript is being used by many. But yes, I see chances. JavaScript developers love their language. The concepts of Google get much critique from them, like on Classes or Interfaces. Somebody even said Design Patterns are so 90ies. On the other side, imagine you can do the things you usually do with JavaScript, just with a language which looks like Java. This is very appealing. If you look at the way Dart deals with HTML - one could say Dart is a cleaned up version of JavaScript.
And of course there is a big company behind Dart, which has an high interest in the success of it. Of course the success depends on the community. But I really think that there might be some products of Google which will be rewritten on top of Dart, once Dart is stable.
There are rumours, that Google targets on Android as platform…
That's right, everybody seems to say that. If you look at the constellation Apache/Google/Oracle and the project Apache Harmony then you can imagine that Google is a bit bored of that kind of stories. Dart has a VM - this is a huge chance. Dart on Android would push the language very much. After all, if one would use Dart on Android, he would probably like to use it on his websites.
But there are more rumours. Some people say Dart might replace GWT one day.
And just before minutes I have heard Dart would make very fine on Google App Engine.
But details are unknown…
Correct.
Is there some kind of roadmap, milestones anything?
"We work very hard on it" - that's what is being said. There are much requests about the HTML library, with which you can work very comfortable with HTML. It has been in the makings for a while now, but even there is no official release date. They are working hard but do not speak about release dates.
11 Nov 2011 8:20pm GMT
James Duncan: Jason Snell explains SIM unlocks for US CDMA iPhones
Macworld's Jason Snell reports (again) on the latest when it comes to iPhone 4S SIM unlocks on Verizon and Sprint. Between this and the fact that AT&T doesn't cut any discount for using an unsubsidized iPhone, it seems that the CDMA carriers are a better option for world travelers.
Linked by James Duncan Davidson.
11 Nov 2011 7:22pm GMT
James Duncan: Unlocked iPhone 4S shipping in 1-2 weeks
The unlocked iPhones are now available at the Apple Store and are shipping in 1-2 weeks. As attractive as this is for heavy international travelers, it's too bad AT&T doesn't seem to have plans that are cheaper if you're not paying the subsidy.
Linked by James Duncan Davidson.
11 Nov 2011 6:43pm GMT
Mukul Gandhi: XPath 2.0 and XSD schemas : sharing experiences
I was just playing with XPath 2.0 and thought of sharing my observations, about a specific use case.
We start with the following XSD schema document,
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="X">
<xs:complexType>
<xs:sequence>
<xs:element name="a" type="xs:integer"/>
</xs:sequence>
<xs:attribute name="att1" type="xs:boolean"/>
</xs:complexType>
</xs:element>
</xs:schema>
This schema intends to validate an XML instance document like following,
<X att1="0">
<a>100</a>
</X>
I wrote an XPath (2.0) expression like following [1],
/X[if (@att1) then true() else false()]/a/text() AND ran this after enabling validation of the input document.
I though that this would not return any result (i.e an empty sequence).
But the XPath expression above ([1]) returns the result "100". At first thought, I was little amazed by this result. I thought, that since attribute "att1" was declared with type xs:boolean in the schema, the "if condition" should return 'false' in this case. But that's not the correct interpretation of the XPath expression written above ([1]). Following is a little more explanation about this.
The reference @att1 in the XPath expression above (i.e if (@att1) ..) is a node reference (an attribute node) and is not a boolean value (which I thought initially, and I was wrong -- I incorrectly thought, that atomization of the expression @att1 would take place in this case; more about this below).
The XPath 2.0 spec says, that if the first item in the sequence is a non null node, then effective boolean value of such a sequence is 'true' (this interpretation is unaffected by the fact, if the input XML document was validated or not with the XSD schema). And in the expression like above (i.e if (@att1) ..), the effective boolean value of the sequence {@att1} is used to determine IF the "if condition" returns 'true' or not (in this case, this sequence has one item [which is also the first item of this sequence] which is an attribute node whose name is "att1", which makes the effective boolean value as 'true' -- and hence the XPath predicate evaluates to 'true'). I think this explains, why the "if condition" {if (@att1)} would return true for the above XML instance document (even if it was validated by the schema given above, and the XPath 2.0 expression above [1] was run in a schema aware mode).
To write the XPath expression correctly, as I wanted (i.e the expression of the "if condition" should return 'true' if the instance document had value true/1 for the attribute, and 'false' otherwise AND an XSD validation of instance document took place prior to the evaluation of the XPath expression), the XPath expression would need to be modified to either of the following styles [2],
OR
To understand why the expressions given above ([2]) work correctly, one needs to understand the XPath 2.0 "data" function (for the first correct variant above, [2] -- this returns the typed value of the argument of the "data" function) and the process of atomization (for the second correct variant above, [2] -- in this case the attribute node "att1" is atomized to return a sequence of kind {xs:boolean}) as described by the XPath 2.0 spec.
That's all about this. I hope that my experience with this may be helpful to someone (to understand this, one just has to know the XPath [2.0] spec correctly, and how it interacts with XSD schemas!).
Thanks for reading this post.
@2011-11-11: updated in place, to correct few factual errors.
11 Nov 2011 4:07pm GMT
Christian Schneider: Hot Standby failover for Apache Camel routes
Blog post edited by Christian Schneider
In enterprise environments a typical requirement is that an integration has to be highly available. Typically you will use at least two nodes to achieve that. Depending on the requirements you will either want all nodes to be active or only one. The problem with having more than one active node is that messages can get out of order. So if your requirement is that your messages keep in sequence then sometimes the only way to achieve that is to make sure only one node is active at any time.
By default Apache Camel has no mechanism to achieve this. So as I had this requirement from some customers I decided to create an addition to apache camel to achieve this.
SimpleCluster
The idea is to use a database table lock to synchronize the locking between the nodes. The reason is that databases are really good at such things and so a database lock is really very reliable. As I did not want to reinvent the wheel I started with the database lock code from ActiveMQ (http://activemq.apache.org/jdbc-master-slave.html). Basically the idea is to do a "select * from mytable for update" in a transaction. This locks the table so only one node can acquire the lock.
|
This is encapsulated in the class DbLockManager. The interface FailoverHandler then allows to register a callback into your own code to be notified that you should start or stop. This part is completely independent of Apache Camel and can also be used for other use cases.
Behaviour
The node that is started first will acquire the lock and start the route. All nodes will try to get the lock after the sleep interval. If the lock can not be achieved the db call blocks till the transaction timeout is reached. So the interval between two tries to get the lock is a little larger then the sleep interval. In case of a connection or db failure the node will stop. So if the DB goes down all nodes will stop. That means you should make sure the DB is also HA.
Configuration
The DbLockManager is configured like this:
<bean id="lockManager" init-method="start" destroy-method="stop"> <property name="dataSource" ref="dataSource" /> <property name="handler" ref="failoverPolicy" /> </bean>
You can also set the sleep time and the lock table name.
The Camel integration is done with a RoutePolicy. Such a policy can be easily added to a camel route and can control the status of the route. So the FailoverRoutePolicy simply needs to be added as a bean:
<bean id="failoverPolicy" class="net.lr.simplecluster.example.FailoverRoutePolicy"/>
So the only thing that remains is to add the policy to the route and to make sure the route does not start on its own:
from("file:target/test").noAutoStartup().routePolicyRef("failoverPolicy").to("log:test2");
Code
11 Nov 2011 2:42pm GMT
James Duncan: The Art of Shifting Costs Elsewhere
When I was in architecture school in the 90's, one of the things we studied in our design studio was how the economic landscape of small towns had changed and how those changes affected the built environment. One of the prime forces we studied was Walmart.
Walmart established itself in rural Oklahoma-among other places, but Oklahoma is the territory we studied-over the course of multiple waves. One wave was to build medium size stores at the edge of small towns where land was cheap. Their selection and low prices pulled a lot of people toward them and away from small local main street shops. After a significant amount of the local competition was shuttered, the next wave kicked in. Walmart consolidated four or five local stores into one larger store and abandoned the others in the area.
The obvious rationale is that building fewer larger stores was more efficient for Walmart to run logistically and that customers benefited by having a larger selection of even cheaper products to buy. The only catch? Most of them had to drive further since the Walmart wasn't in their town and most of their local stores were closed down. The local stores that remained open were simply more expensive to shop in.
Now, we were studying architecture, not economics, but it was easy to hypothesize that part of the efficiency gain for Walmart is that they had successfully shifted part of the cost of transporting goods in two ways. First, government had to pay more to build the all the roads, highways, and infrastructure needed for everyone to drive to and from those Walmarts. Second, customers had to pay for the gas and maintenance on their vehicles they used to go back and forth.
To bastardize a telcom analogy, if this is true, it's a successful shift of the last ten or twenty mile cost of selling goods. Never mind that the governmental cost for building all those roads and the individual costs of consumers buying gas and maintaining cards is probably much larger in aggregate than the savings. Walmart didn't have to pay it, the perceived prices were lower, and it's as easy enough to ignore the rest as it is to let spare change build up in your sock drawer.
If it was in any way intentional, it's brilliant albeit in an evil way.
I've been thinking about that a lot lately. I'd love to find a good solid study on that sometime again and read up on it. The idea of shifting costs to a nebulous "elsewhere" also seems to apply pretty well to my current thoughts about the US Patent Office.
I've heard from multiple sources that patent examiners only really search for prior art in the existing patent library. If it never was patented, it must not exist. It's about the only explanation for how patents like the method for swinging on a swing get through.
Now, I get that the patent office is swamped, there's too much work for the patent examiners, and the fees for patent applications don't cover the costs of running the office. But, taking a shortcut approach and only searching within the set of issued patents means that they're shifting-intentionally or not-the cost of rejecting applications before issuance to the people and corporations that pay to defend themselves by challenging the issued patents. At least those that are able to pay without much consequence to their bottom line.
It's fascinating and maddening all at once.
Posted by James Duncan Davidson.
11 Nov 2011 8:02am GMT
Antoine Toulme: Getting help spelling words - Introducing the spell project

When I finally received my credit card after spending an hour spelling every other identifier I had, from my mother's maiden name to my email address to my SSN,
I was really frustrated to see the cards come in with the wrong name and a facetious first name.
I could easily tell my French accent and the bad quality of the line were to blame for those results.
Determined to get this right, I called back from a better spot in a silent room. This time, the cards came in ok - but my email address got in mangled.
I decided to look into phonetics to spell my name correctly, and being lazy, I looked at how to never have to think about it again.
I took up the say command line from Mac. It tells my name fine, and the -o option outputs a aiff file for it that I could convert back to wav easily.
That solves the French accent issue. But I wanted to make double sure the guy on the other end of the line would get this right.
Next up, I took the nato alphabet and outputted each letter into a separate file. I converted all files to wav.
Bringing it on the interwebs
I created a simple html file that went like this:
The html loads up all wav files using the audio tags, and you play them in the order given by the string you enter in the text input.
The catch in there - and you see the code trying desperately to make up for it - is that the play() method of the audio tag is triggering playing, but returns right away. It's difficult to catch when the feedback is completed (I didn't find a way to register an event handler).
So the code is doing a dodgy thing, it's waiting the length of the sound, plus an extra 200 milliseconds, then plays the next letter.
If there's a way to listen to the feedback of the audio tag, I would definitely love feedback!
Server-side ruby
So the next best thing to try, put Ruby to work and see how it could combine the sounds to stream them out.
I stumbled into this most excellent blog post that deals with WAV. The goal of the author was to analyze wav files ; it mentioned how to decompose the binary format of a wav file, and the last piece in Ruby showed how to use unpack to read a wav file with a one-liner.
data.unpack 'Z4 i Z8 i s s i i s s Z4 i s*'
I took this Ruby for a spin and I managed to read my wav files and even combine them on the fly.
I put together some minimal code to load all the sounds available in memory, then combine them into the order dictated by the sentence required.
By placing it behind a sinatra app, it should be easy to serve sounds based on strings.
The code is licensed under MIT license, available on github.
Being a week-end rambling, the code comes with no tests, miss support for numbers, and god forbid you use a multi-byte character!
Please feel free to fork it and have fun out of it.
11 Nov 2011 5:00am GMT
10 Nov 2011
Planet Apache
Bryan Pendleton: Coarse language in professional writing
Scott Hanselman shares his feelings about the use of coarse language.
Zach Holman responds.
Ted Dziuba says the real issue is passion and honesty versus marketing and publishing.
I can see both sides. Nothing about the language that Holman uses (or that Heinmeier Hansson does, for that matter) gets under my skin; perhaps I'm just thicker-skinned than many.
But meanwhile I know many readers who are put off by such things.
And so in my own writing I do my best to avoid such.
But I agree that you should (a) write about what you care about, and (b) write in your own words, not in the words that you think others want you to speak.
I guess I'm not adding much to the conversation, but there you go: a few pointers to some interesting articles and a bit of an observation by me.
10 Nov 2011 2:45pm GMT
Michael McCandless: SearcherLifetimeManager prevents a broken search user experience
In the past, search indices were usually very static: you built themonce, called optimize at the end and shipped them off,and didn't change them very often.
But these days it's just the opposite: most applications have verydynamic indices, constantly being updated with a stream of changes,and you never call optimize anymore.
Lucene's near-real-time search, especially with recent improvementsincluding managerclasses to handle the tricky complexities of sharing searchersacross threads, offers very fast search turnaround on index changes.
But there is a serious yet often overlooked problem with thisapproach. To see it, you have to put yourself in the shoes of a user.Imagine Alice comes to your site, runs a search, and is lookingthrough the search results. Not satisfied, after a few seconds shedecides to refine that first search. Perhaps she drills down on oneof the nice facets you presented, or maybe she clicks to the nextpage, or picks a different sort criteria (any follow-on action willdo). So a new search request is sent back to your server, includingthe first search plus the requested change (drill down, next page,change sort field, etc.).
How do you handle this follow-on search request? Just pull the latestand greatest searcher fromyour SearcherManageror NRTManagerand search away, right?
Wrong!
If you do this, you risk a broken search experience for Alice, becausethe new searcher may be different from the original searcher used forAlice's first search request. The differences could be substantial,if you had just opened a new searcher after updating a bunch ofdocuments. This means the results of Alice's follow-on search mayhave shifted: facet counts are now off, hits are sorted differently sosome hits may be duplicated on the second page, or may be lost (ifthey moved from page 2 to page 1), etc. If you use the new (will bein Lucene3.5.0) searchAfterAPI, for efficient paging, the risk is even greater!
Perversely, the frequent searcher reopening that you thought providessuch a great user experience by making all search results so fresh,can in fact have just the opposite effect. Each reopen risks breakingall current searches in your application; the more activeyour site, the more searches you might break!
It's deadly to intentionally break a user's search experience: theywill (correctly) conclude your search is buggy, eroding their trust,and then take their business to your competition.
It turns out, this is easy to fix! Instead of pulling the latestsearcher for every incoming search request, you should try to pull thesame searcher used for the initial search request in the session.This way all follow-on searches see exactly the same index.
Fortunately, there's a new class coming in Lucene 3.5.0, thatsimplifies this: SearcherLifetimeManager. The class isagnostic to how you obtain the fresh searchers(i.e., SearcherManager, NRTManager, or yourown custom source) used for an initial search.Just likeLucene's other manager classes, SearcherLifetimeManager is veryeasy to use. Create the manager once, up front:
SearcherLifetimeManager mgr = new SearcherLifetimeManager();
Then, when a search request arrives, if it's an initial (notfollow-on) search, obtain the most current searcherin theusual way, but then record this searcher:
long token = mgr.record(searcher);
The returned token uniquely identifies the specificsearcher; you must save it somewhere the user's search results, forexample by placing it in a hidden HTML form field.
Later, when the user performs a follow-on search request, make surethe original token is sent back to the server, and thenuse it to obtain the same searcher:
// If possible, obtain same searcher version as last
// search:
IndexSearcher searcher = mgr.acquire(token);
if (searcher != null) {
// Searcher is still here
try {
// do searching...
} finally {
mgr.release(searcher);
// Do not use searcher after this!
searcher = null;
}
} else {
// Searcher was pruned -- notify user session timed
// out
}
As long as the original searcher is still available, the manager willreturn it to you; be sure to release that searcher(ideally in a finally clause).
It's possible searcher is no longer available: for example if Aliceran a new search, but then got hungry, went off to a long lunch, andfinally returned then clicked "next page", likely the originalsearcher will have been pruned!
You should gracefully handle this case, for example by notifying Alicethat the search had timed out and asking her to re-submit the originalsearch (which will then get the latest and greatest searcher).Fortunately, you can reduce how often this happens, by controlling howaggressively you prune old searchers:
mgr.prune(new PruneByAge(600.0));
This removes any searchers older than 10 minutes (you can alsoimplement a custom pruning strategy). You should call it from aseparate dedicated thread (not a searcher thread), ideally the samethread that's periodically indexing changes and opening new searchers.
Keeping many searchers around will necessarily tie up resources (openfile descriptors, RAM, index files on disk thatthe IndexWriter would otherwise have deleted). However,because the reopened searchers share sub-readers, the resourceconsumption will generally be well contained, in proportion to howmany index changes occurred between each reopen. Just be sure touse NRTCachingDirectory, to ensure you don't bump upagainst open file descriptor limits on your operating system (thisalso gives a good speedup in reopen turnaround time).
Don't erode your users' trust by intentionally breaking theirsearches!
LUCENE-3486has the details.
10 Nov 2011 11:32am GMT
