Sunday, December 30, 2007

Numbers in a translation dictionary

In the Wikimedia Foundation, there are the inclusionists and the exclusionists. Some are of the opinion that certain topics should not be included in Wikipedia while others do. The most brilliant example is the inclusion of all the busstops in Japan. Someone took the effort to describe them and people find them actually useful.

Some people are enamoured by constructed languages and spend a lot of time making such languages their own. I personally have had dealings with at least three people that speak Volapük, and I know people that go to congresses because they meet people who speak Esperanto. Many constructed languages have more speakers than many natural languages (that are not yet extinct).

For exclusionists it is not palatable when constructed languages do well. There are always "good" reasons why those others need to be excluded. Hidden in the discussion about the "radical cleanup of the Volapük Wikipedia" is a discussion about the inclusion of numerals like 588 in the Limburgian Wiktionary and as you can imagine "it is not good".

In a translation dictionary there are reasons to include numbers. The point is that they are not written the same in all scripts. OmegaWiki has its fair share of numbers, and it did not address this issue.

By adding a new class, we now allow the representation of Arab and Roman numbers, and hundred has now as an annotation both 100 and C. In this way we do not have a separate page for each numerical representation.

Thanks,
GerardM

Tuesday, December 18, 2007

Languages ...

Given a fixed point and you can move the world. In many ways, your language provides you with the tools to wrap your mind around the world, express its essence in a way that may be understood by the people you communicate with and help you to shape the world as you know it.

Language is both individual and shared. My English has been shaped by my schooling in the Netherlands, my stay in the United Kingdom and the many times I expressed myself in e-mail, articles, presentations and when skyping. Typically I get it right but a text can be understood by some and misunderstood by others. It is my English but to function it has to be expressed in a way that is shared with others.

For some subjects I prefer my native language, for others I prefer English. To communicate, the language used must be sufficiently shared by everyone involved. The language must be received; when I am surrounded by a vacuum, nobody will hear me talk. To read this blog, you either have access to a computer or someone must print it out for you.

A language lives when people use it, when it is part of a distinct community, a distinct culture. When the boundaries around such a community or culture disappear 0r change, the language either morphs or it dies. To understand history, you have to understand its artefacts and its language. Many languages die and died and with it we lose the history, the culture of the people that spoke that language. They may leave their literature, inscriptions and when enough is left, we may understand what is says. The trick will be to understand it as it was meant when the language, the culture was alive.
Thanks,
GerardM

Saturday, December 15, 2007

Whe needs birthPlace

With some regularity I try to better understand Semantic Web and associated subjects. I find it hard going but also a compulsive subject. When you express the relation "Johan Cruijff" "birthPlace" "Amsterdam", it is understandable to you as a reader but for humans it should read like "Johan Cruijf was born in Amsterdam" or "Johan Cruijf werd geboren in Amsterdam" .. This magical statement "birthPlace" can be interpreted when you know your English otherwise it is truly for machines only.

OmegaWiki does express relations, you will find for instance that Amsterdam is the capital of the Netherlands. In essence it is expressed as a triple, but it is expressed in natural language and depending on the existence of a translation, you will read the relation in the language selected as your user preference.

How to combine what we do and what happens elsewhere, my latest idea is based in the RDF tag; "birthPlace". It is a construct that obviously needs a natural language equivalent and this is what OmegaWiki can provide. A method is needed to connect the two. In order to function, birthPlace has a very precise definition and this definition must be part of a collection of such definitions. These labels need to be linked to OmegaWiki DefinedMeanings as the identifier for an OmegaWiki collection.

To make this useful, an external application needs to call a function that provides the translation to a specified language. How to combine this with the notion of an URN I have not figured out yet.

Thanks,
GerardM

Thursday, December 13, 2007

Eastern Yiddish

Eastern Yiddish, is one of the two varieties of Yiddish that have been recognised as languages in their own right in the ISO-639-3 (ydd). in OmegaWiki, we now have our first 213 Expressions in this language and I am impressed with the amount of work that has gone into it; most have annotations indicating hyphenation and the pronunciation using IPA notation.

Eastern Yiddish is not one of the languages supported by MediaWiki, and the mechanism for showing localised content is connected to the language selected in the "User Preferences". I have been given some help from Siebrand what files need to be changed and added. Kim helped me with doing it for the first time and now the first localisation is visible for Eastern Yiddish.

The MediaWiki localisation itself uses Yiddish as the fall back language so the experience is pretty good for now. What Siebrand indicated is that is is possible to include languages like Eastern Yiddish in the BetaWiki. This would create stubs that are of benefit to OmegaWiki. It would prepare for the moment when people start localising in earnest.

I think it would be a good thing, but I am interested to learn what other people think.

Thanks,
GerardM

Monday, December 03, 2007

Supporting American English

American- or British English are two variations of the English language. They have substantial differences. They are sufficiently the same and are unlikely be mistaken to be separate languages.

In OmegaWiki, it has been possible to add entries for English; this meant there is no difference between written the different versions of English or you had to specify both versions. Issues like this exist for other languages like Serbian and Mandarin as well.

In the OmegaWiki user interface, languages are considered ISO-639 entities. When a DefinedMeaning for a language is part of the appropriate collection, we use the translations in our user interface. The problem is that all these linguistic entities are needed now and that they are created to make OmegaWiki work.

For the ISO-639-6 there will be issues as the codes we make, using the RFC 4646 methodology, will be replaced. It will also be interesting to learn how in the end everything will be merged together.

In the mean time we now support localisation for these linguistic entities.

Thanks,
GerardM

Sunday, December 02, 2007

Sinterklaas present

In the Netherlands we traditionally do not get presents with Christmas. For us Sinterklaas is celebrated on the fifth or sixth of December. In my family everyone no longer believes in Sinterklaas and consequently we can celebrate it on a more convenient moment like in a weekend.

I have had a wonderful Sinterklaas, and I do want to tell you about the present that Kipcool and Kim gave me. Kipcool wrote this wonderful functionality that show you what classes we have in OmegaWiki and, how many translations we have in your language.

With the new functionality you will see the concepts that are translated in your language. When you check out a concept, you will even find what attributes are available in your language. I have found it to be really addictive. :)

Thanks,
GerardM

Monday, November 19, 2007

New upload functionality

OmegaWiki is really happy to announce that we have, with thanks to the Otto-Friedrich University of Bamberg, for the first time used new upload functionality. The University of Bamberg has a need for a repository for its Destinazione Italia terminology and has found this in OmegaWiki.

We have uploaded translation in Persian and we have it nicely attributed to the person who did the work. We will upload for several more languages. What is of relevance is that we can make an export for any language, for any collection. So if you are interested to help on our OLPC collection.. just drop me a line..

Thanks,
GerardM

Friday, October 19, 2007

Money

All projects need money to operate. OmegaWiki does need money to operate. We now have made it possible for you to support what we do in a practical way; we now have a link in our sidebar so that you can use Pay Pall to donate money. You can just give us money, you can help us fund a project that we want to do.. Check out our Donations, putting your money where your mouth is ..

Stiching Open Progress is a Dutch "not for profit" organisation and we can use all the money we can get to do all the cool development work we would like to do..

Thanks,
GerardM

Tuesday, October 16, 2007

Zimbabwe was formerly known as Rhodesia

Many countries have over time changed their nature. What they typically do is stay more or less in the same shape. As a consequence of war, the shapes do change. The change is often reflected in the name. The country that is now called Zimbabwe was once called Rhodesia. It is relevant information and can be expressed using relations.

In OmegaWiki the relation type "was formerly known as" has been introduced to express this relation for countries. It demonstrates that OmegaWiki is not strictly a dictionary, it also serves the functions of a dictionary. By including different types of attributes to classes, we provide more worthwhile information.

Most concepts are related to other concepts and when these relations become visible, a net develops of related information. This does not make OmegaWiki an encyclopaedia, it is what an ontology does. An encyclopaedia we are not; we refer to Wikipedia.. :)

Thanks,
GerardM

Friday, October 12, 2007

Linking to Wikipedia

OmegaWiki may be a lot, but it is not encyclopaedic. We do not want to be; Wikipedia does a great job at it and when it needs competition there are plenty of pretenders to its throne. So we do not compete.

When people need information, OmegaWiki will not provide all information. What it can do is link to other sources of information and Wikipedia is the obvious and the only choice for encyclopaedic information. It is the only choice because it aims to be multi-lingual and it is an obvious choice because of the shared values.

At this stage, linking to the Wikipedia articles is done by hand so initially there will be few links. We hope to harvest these links from Wikipedia and insert them with a bot. In this way we will provide an encyclopaedic service without being encyclopaedic :)

Thanks,
GerardM

Friday, October 05, 2007

Antonym

An antonym is the complete opposite of something.. black and white are probably the best known examples. The great thing is that antonym is the first global relation type and in the way it is set up, the antonym is true on a concept level. This means that it does not allow for cultural differences in the appreciation of such a relation.

I wonder how many antonyms will prove to be problematic because of cultural differences. The good news is that we are now able to have global relation types in OmegaWiki. We will have to be REALLY careful what relation types we will include. The "is a" relation is not going to be part of it because that is what makes something a class member.

With the global and the class based relation types we only need the collection based relation types to get our full functionality :)

Thanks,
GerardM

Tuesday, October 02, 2007

More localisation

OmegaWiki had a lot of new functionality go live, today I spend time on one aspect that is really dear to me; localisation. Much of the OmegaWiki content is localised by adding translations. With the latest software release much of the more programmatic parts are in the system messages.

I started to translate the Dutch messages and, Tosca caught on and started on the German messages, Malafaya did the Portuguese. We hope that in this way our data becomes even more accessible to our users :)

One issue remains, with our messages translated in OmegaWiki, how do we get them in MediaWiki proper ...

Thanks,
GerardM

Getting to grips with the new functionality

At OmegaWiki, a lot of new functionality has gone on line. This functionality is a mix of functionality that was needed for Wikiproteins and things we have been working towards for a long time. With the changes some functionality does not work as it used to. This is a good thing.

Our first content was the GEMET thesaurus, in this collection particular relation types were used. These relation types were available everywhere and consequently we have been reluctant to add more relation types. Now relation types are associated with "classes" and we can make a DefinedMeaning a member of a class. Nederland now has a capital, a motto, a nation anthem and entities bordering the country. For Nederlands it is now known what script it is written in, and in what countries it is spoken. And with the "incoming relations" we know where there is a reference to the DefinedMeaning.

Many of the existing relations will be changed from the GEMET relation types to the new relation types. The work that what is done in the past is a huge benefit as it helps a lot in identifying what needs doing. With the new functionality it makes sense to add the annotations straight away; we now know that they can be done properly.

Thanks,
GerardM

Friday, September 28, 2007

Major update for OmegaWiki

OmegaWiki has had some major update; the version of MySQL that is installed has been updated, several files have been changed to InnoDB and a lot of functionality has changed behind the scenes.

One of the effects is that the performance has improved noticeable; that was really needed. The difference in performance is a relief. It is fun again to work on the data.

One difference is that the way the relations work; relations are currently associated with a "class" and this class defines what relation types are possible. We have added a few classes so far; "linguistic entity" is one. The associated relation types allow us to indicate where the linguistic entity fits in and, where it is spoken. There will be many more classes and relation types, the quality of the classes and relation types will make a difference to the quality and the usefulness of our data.

Thanks,
GerardM

Friday, September 07, 2007

Demo Semantic Support on a new URL

At Wikimania I presented what we are doing to bring real time Semantic Support to Wikipedia. The URL in my presentation is no longer valid, the new location is at: wikipedia.wikitestsite.org.

You will find a dump of the English Wikipedia and you many of the expressions that we already now are in green. We are working towards a situation where new concepts defined in OmegaWiki will be recognised in the future data mining of the same article.

What we are discussing at the moment is adding functionality to the concepts found. Some are obvious like giving the definition, giving an option to go to OmegaWiki when the definition does not fit, showing translations for the expression in the language that is of interest to the reader.

We can imagine that there is more functionality that you would consider useful. Please let us know .. :)

Thanks,
GerardM

Monday, September 03, 2007

Connecting data from different databases

In OmegaWiki there are different datasets. These represent different origins and have a different emphasis. What we are working on is to connecting the data in these different datasets. Currently over four percent of our Community data is connected to data of the UMLS.

These connections are not without problems. The UMLS does not have the same (lexical) outlook; it is quite happy to have a singular and a plural to be part of the same concept. In OmegaWiki we do not support the notion of plurals yet. For the UMLS it is not a problem to include Geologists as it is included as a subject heading. We have it connected to geologist.

Lyme disease has several synonyms that are problematic from a lexical point of view; only "Lyme borreliosis" is what I expect to find in a dictionary. This does not necessarily mean that "Borreliosis, Lyme" is not useful to have. The Community database knows some 15 translations and thereby adds value to the English only content for Lyme disease.

With four percent of the Community Database connected, in reality we haven't scratched the surface of the UMLS. The UMLS is a well explored resource and I am sure that there are many resources that have made connections already. I hope we will find the people, the organisations willing to share the work that they have already done.

Thanks,
GerardM

Monday, August 27, 2007

Some more on Wolof

OmegaWiki wants to support all words of all languages and, it does not want to go into the issue of does this language exist or not. We make use of the ISO 639 standards and, when we feel like being adventurous, we look at what is recognised in the IANA language tags.

Deferring to standard organisations means that you take what they say as the "truth". It does not mean that we necessarily agree, but it saves us from a lot of mayhem. Yesterday I wrote about the first native Wolof speaker for OmegaWiki. Today Ibou changed the definition for Wolof and included Gambia as a country where Wolof is spoken. According to the description by Ethnologue of the Wolof language this is not the case. They do refer to another language, Gambian Wolof, this description makes it clear that Wolof is spoken in the Gambia as well.

The article on Wikipedia on Wolof is in my opinion wrong; it gives the impression that the ISO-639-1 and the ISO-639-2 codes are split into two. This is contrary to how standards work. When a language is split into two, the original meaning will stand as it is, it will get a new description to indicate that it has been split and two new codes will be created.

So Ethnologue is inconsistent. Ibou is probably right. I have send an e-mail to Ethnologue and I hope that they will amend their fine resource so that we will know for sure that he is right. :)

Thanks,
GerardM

Sunday, August 26, 2007

One new user

Sometimes a new user is special. To me Ibou is special. He is the first Wolof native speaker on OmegaWiki. He is the first person where I have been told for whom communicating in English will be difficult.

I could not be more happy with what he has done so far; he created the Babel templates for Wolof. He has translated the first part of the main menu. Really, he makes the next Wolof speakers feel welcome..

Thanks,
GerardM

Friday, August 24, 2007

Localisation of OmegaWiki

What makes OmegaWiki so special, is that the presentation of the data is shown in the language selected in the user preferences. The data is sorted properly. It is really nice.

In the next version of the software, it will be possible to localise the headers as well. These are currently in English only. In contrast to how the localisation is done, the headers will be system messages. In this way each Wikis for Professionals can choose the headers that provide the best fit for its application.

Thanks,
GerardM

Sunday, August 12, 2007

30.000 Expressions for English in the Community Database

The word "Southern Sierra Miwok" is a language particular to California. In 1994 there were still 7 people that spoke this language. According to the UMLS, the name of the language is Meewoc and as Ethnologue provides it as one of the alternate names, it was possible to link this DefinedMeaning in the Community Database with the concept in the UMLS.

With 30.000 Expressions, there are many that also exist in the UMLS Authoritative Database. The UMLS has some 1.93 million Expression at the moment, and the first 200+ DefinedMeanings have already been linked. They are mainly US-American languages and chemical elements with an occasional animal like guinea pig thrown in for good measure.

It is relevant to have the concept linked. It means that the information in one database can be seen as supplementary to what is available in another database. Currently we have three databases, but when you consider how they are structured, there are implicit connections known as many of the concepts known in the UMLS are also known in the Swiss-Prot database. The only thing left doing is making them explicit. :)

Thanks,
GerardM

Monday, July 30, 2007

UMLS

The UMLS or Unified Medical Language System is a collection of many resources it contains tools, a semantic network and a specialist lexicon. It is also a collection of many resources. These resources all have their own license and copyright. Effectively much of the UMLS can be used for many purposes because the particular license allows it. In the same way, there is much of the UMLS that can only be used when the copyright holder gives permission.

In OmegaWiki, we have our first Authoritative Database online. It is the UMLS and we are proud of it. Now as the UMLS is this collection of connected resources, we present it in the same way. There is one UMLS as an Authoritative Database and it has collections that are the parts that make up the UMLS as we have it. The important thing of the UMLS is that it did make the connection between the different databases and we do use their system to connect.

What makes the inclusion of the UMLS so special is that we have the cooperation of the NLM. It is what makes this such an exciting experiment.

Thanks,
GerardM

Sunday, July 29, 2007

IL7R-alpha and IL2R-alpha anyone ?

The OmegaWiki database is blocked for editing at the moment. The reason given is: "Importing new data". For me this is great news. It means that we are importing the data that we have been preparing for a long time. It means that we are closer to getting the first Wiki for Professionals life.

Today, on the BBC-news website there is an article where IL7R-alpha and IL2R-alpha play a major role. They are proteins, more specific they are genetic variants of proteins that play a role in the expression of multiple sclerosis.

OmegaWiki will contain terminology like IK7R-alpha, it is specialised terminology but as it can be found in sources like the BBC website, it is good to have it. Wikiproteins will be the first Wiki for Professionals and, it will allow for the further annotation of these proteins by people who know about these substances.

It is really thrilling to see all the development needed to come to a first public outing come to a close. I congratulate the members of the consortium that make Wikiproteins possible. I believe that this has the potential to become an important tool for scientists. Wikiproteins is possible because of the many people who believe that Open Access is essential to science.

By going life, we invite comments. These will help us to make sure that the functionality is just right. The best people that can help us identify what more needs to be done are the people who will become part of what will be the Wikiproteins community.

The "official" announcement of Wikiproteins going life is scheduled at Wikimania 2007 :)

Thanks,
GerardM

Saturday, July 28, 2007

DMM - Swahili content for OmegaWiki

Yesterday, I reintroduced the notion of "Donations, putting your money where your mouth is" on this blog. Today I want to tell you about one of the first such projects.

The Kamusi project is a really important project to create a dictionary for Swahili. The project was a project of Yale University and Martin Benjamin was its editor. The project is probably one of the most important resources for the Swahili language and it is therefore really sad that the activity of this project came to an end because of a lack of funding.

Martin is preparing a new project for African languages called PALDO or the Pan-African Living Online Dictionary. This project aims to create content in many of the important African languages. Martin has been given permission to use the content of the Kamusi project from Yale. This means that it becomes possible for him to collaborate with other projects as well.

It is with pride and gratitude that I can say that PALDO and OmegaWiki are going to work together. This means that we need to get the content of Kamusi analysed and imported. It also means that we have to analyse and build the functionality so that we can give back to the PALDO project. More information can be found here.

With a 70.000 word Swahili dictionary, we have sufficient data for the first two OLPC dictionaries that will amount to something. They will be Swahili and English.. the English content comes with Kamusi as well :)

So you can help; you can develop, you can edit and you can sponsor this project.

Thanks,
GerardM

Friday, July 27, 2007

Donations, putting your money where your mouth is

When you want to get things done, you can do it yourself or you can get someone else to do it for you. Within many Open Source or Open Content projects you can donate your programming, your content and your money. When you are a programmer or an editor, you can choose what to develop, what to edit, because you are a volunteer. Nobody can tell you what to do. When you volunteer to give money, there is no such luck. You can give, you may be thanked, and that is it.

Unless of course you are a big time donor. When you give a sufficient amount of money and the purpose for this money fits within the aims of the organisation you give it to, you can determine what the money is spend on. This is not an option for small time donors.

Many small projects have been identified that need doing, projects that do not get done because they do not have priority or because nobody volunteers to do them. For such projects a specification can be made and a cost estimate can be given. These can be published and donations for these projects can be solicited. When enough people have contributed funding for a project, it can be executed.

For the complete policy read; Donations, putting your money where your mouth is.

Thanks,
GerardM

Thursday, July 26, 2007

What language is this text ?

When you write a text, you know your audience and you select a language accordingly. Given that English is the lingua franca of this day and age, and given that my public is international I do write in English. However, there is nothing that stops me or any of the other people who contribute to this blog from writing in a different language.

This is a bad thing. It would be so much better when I was able to actively indicate the language that I am writing. Obviously, Blogger can have its own routines to distinguish certain languages, but I am absolutely certain that they will not recognise the majority of languages.

While I am typing this blog, I have indicated to my spell checker that I am using UK English spelling. This means that many of the mistakes I make will not be seen by you. Having indicated that the languages IS UK English, it would have been great when it was picked up by the Blogger software.

Consider, when I inform Blogger that I am writing UK English, my Firefox spelling extension does not need to guess anymore. It would provide me with a much better functionality and it would make functionality possible in languages that are not well supported..

So please blogger.com, please allow me to tag the language of my texts.

Thanks,
GerardM

Wednesday, July 25, 2007

A new menu

In preparation for the presentations at Wikimania, we are making OmegaWiki extra nice. I am finishing the new main page where you will find information about the things we have been preparing that are not there yet.

There have been things in our wiki that do not have any functionality yet. A lot of work has been done in the last weeks in refactoring our code. We are making changes to the code in order to make it easier for new developers to get to grips with the code. These changes will make it possible to build some of the more complex functionality that we need.

While we are working hard at OmegaWiki, Knewco is working hard preparing their Desktop, with its Knowlet and Semantic Support. It is really cool that OmegaWiki will not only be useful in its own right, but that applications are going to be build on top of it.

I am really excited about going to Taipei. I will be happy to talk and demonstrate what we are on about.. It is less than a week I am thrilled to see all the everything coming together :)

Thanks,
GerardM

Tuesday, July 10, 2007

Ch'orti', a language spoken in Guatemala and Honduras

Ch'orti' as a language has caa as its ISO 639-3 code, some 30.000 people speak the language and according to Reeck many more belong to the associated ethnic population.

I have been adding languages to the ISO 639-3 collection for some time now, I started with Ghotuo (aaa) and I have now progressed to Ch'orti' (caa). Many have few speakers, many are extinct, several are sign languages and almost all of them I have already forgotten.

So why do this, is there method to this madness.. OmegaWiki aims to include all words of all languages, but what languages are there ? Do we want to discuss the notion of yet another linguistic entity that we should support. Does something like Brithenig (bzt) deserve its place under the sun ?

I do not mind the discussion, but I do mind what the result will be of such a discussion. It needs to come to a conclusion and I do not want to be in the position that people look to me for a verdict. It is not a good idea either to have the OmegaWiki commission be in that position. It is for all these reasons that we decided on adopting standards and started with the creation of portals for the ISO 639-3 languages. We are now at the next phase, creating the DefinedMeanings for these languages and make them part of the ISO 639-3 collection.

This is only what is recognised by one standard, there are other standards that help indicate what the precise linguistic entity is that is to be documented in OmegaWiki. First we should finish this, there are currently 1365 entries in the ISO 639-3 collection .. there are many more thousands to go :)

Thanks,
GerardM

Wednesday, July 04, 2007

Aklanon ...

Aklanon is a language spoken in the Philipines. The ISO-639-3 code is "akl" and according to a 1990 census some 394.545 people speak this language.

On OmegaWiki, Aklanon has its own portal and I was really thilled when Chief Mike indicated his interest in working on the Aklanon content. We do want Aklanon but we also have our own standards. One of these standards is that the Babel templates for a language are in that language. I really appreciate the notion that the Babel templates have to be understood however, the Babel templates are one of the first things that we hope to get in any language.

When we have the Aklanon Babel templates in Aklanon, it will be a privilege to have Aklanon as the next language that we support in OmegaWiki.

Thanks,
GerardM

Saturday, June 23, 2007

Assorted statistics

OmegaWiki has reached 30.000 DefinedMeanings, we have some 258.000 Expressions. And as some people do not stop telling me there are about 28.000 Expressions in the largest language and this means that there is a close relation between the number of Expressions in a language and the number of concepts. This is said to indicate that OmegaWiki should be able to scale. :)

The Webaliser statistics have given me a surprise; there is now more info to be found. What is nice to see is that there is now a breakdown in where the traffic comes from. As we have a lot of traffic from crawlers, it would be good to exclude crawlders in order to see where interested PEOPLE come from. Erik told me that the new features are probably due to the upgrade of this week.

Malafaya is now the fourth person who has taken an interest in our statistics. He has worked on the reliability of the statistics of collections. His first effort improved the numbers, his second stab at it improved the performance of the queries a lot.

Finally the Alexa statistics have improved a lot for no apparent reason. We have had times when we were not ranked at all or we could be found above the 800.000 range.. Now we are for a few days hovering around the 368.500 mark. Still not impressive but it looks much better. When you compare the Alexa numbers with our Webaliser numbers, the only thing that can be said is that for Alexa the numbers are statistically not really valid.. This will improve as our community grows.

Thanks,
GerardM

Monday, June 18, 2007

Server upgraded; dataset support online

The OmegaWiki.org server has been upgraded to Debian etch. This gives us PHP 5.2.0, which is needed to run the latest version of OmegaWiki. (In the process, we exchanged our hand-compiled PHP and Apache binaries with distribution packages.) OmegaWiki itself has also been upgraded. The current version of the code has support for so-called "data-sets".

A data-set is essentially an instance of OmegaWiki which can contain a completely separate set of DefinedMeanings and associated data. This is useful for importing authoritative sources which may either not yet be fully editable, or which are meant to be retained alongside an editable version. It also allows us to showcase imported databases, to convince organizations that own the data to release it freely and make it fully editable.

The current version already supports mapping DefinedMeanings across data-sets. So you can indicate that concept A in data-set 1 is the same as concept B in data-set 2. However, it does not yet support copying data from one data-set to another, which is what we are working on right now (some hints to it are already in the code).

Currently OmegaWiki has a single data-set only. We are considering to set up some example data-sets to let the user community play with this new functionality.

A word of the day

Like so many other resources that are lexical in nature, OmegaWiki has a word of the day. Our word of the day is not prepared in advance and we leave it to the community to create one. I am always relieved when there is actually a word of the day when I wake up.

Today's word of the day is interesting for many reasons. The word is wheat. There are several issues to consider.
  • It is marked as "English (United States)". There is however no "English (United Kingdom)" and as I cannot find this alternate, it should be just "English".
  • The definition has not been translated into English. This is very much optional, but it makes it so much easier to translate the definition in yet another language
  • In the definition, wheat is said to be part of the family ''Graminacee" of the genus "Triticum". According to Wikipedia the family should be "Poaceae".
The big thing here is that taxonomy while a science, is not exact. There is no such thing as a name that will be true for forever. With some regularity it is found that names need to be changed. These revisions may mean that species that are known to the public are no longer to be that species, they can be split up or lumped together.

The issue here is that without being able to reference to both families that are grassy, it is hard to appreciate this definition. This word of the day clearly shows why there is a need for a dictionary of life, a dictionary that explains all these names and shows the relations between the different validly published taxonomical names.

Thanks,
GerardM

Sunday, June 17, 2007

OmegaWiki only a translation dictionary ?

There is some misinformation about OmegaWiki, it is said for instance that OmegaWiki is only a translation dictionary. There are also people who do not consider OmegaWiki as relevant because it is not a Wikimedia Foundation project.

It is for the people that have not looked at OmegaWiki for a long time or have not really looked well that we want to state the obvious; OmegaWiki is not only but also a translation dictionary. When you look at the number of expressions per language, you will find that we have almost 30.000 DefinedMeanings, the reason why we have 11.000 more English Expressions then what we have for any other language is because we have collections that are at still mostly English. Collections like the ISO-DIS-639-6 are relevant because of the information that is included in the data.

OmegaWiki is becoming relevant because our data is starting to be used outside our project as well. Positano News uses OmegaWiki data for "assisted reading", this helps people to understand terminology that is in an Italian news article. It does give you definitions and translations.

It may be that the current possibilities at OmegaWiki are not immediately obvious; there are many DefinedMeanings that do not have any annotation. An annotation can identify the part of speech for a word, it can provide you with a sample sentence or how to hyphenate a word. We want to include links to other websites; we want to link to Wikipedia articles in order to make it convenient to our users to find good encyclopaedic information.

OmegaWiki is not feature complete. We want to add many more features, but our first priority is to make sure that it works well and that the features that matter most are included. We need to improve on our performance and, we need to make sure that we provide a framework that facilitates collaboration with other organisations.

The Wikimedia Foundation is one organisation that we really want to collaborate with. On a personal level we have been involved and we want to extend this by collaborating on an organisational level as well. This often repeated intention may be one reason why certain people are so apprehensive about OmegaWiki; we wanted it to be a WMF project, it is not a WMF project but we still see room for doing good together.

Thanks,
GerardM

Saturday, June 16, 2007

If you love somebody set them free

On OmegaWiki we have many sysops. Giving people the abilities that comes with the sysop flag is what has prevented a lot of vandalism and spam. We are happy and grateful that this has worked out so well for us. As a consequence, we do not have the eternal admins versus the editors controversy, our admins do not have to do anything; they are kindly requested to do good and amazingly they do.

With some sadness, we learned that a Wiktionary admin is leaving Wiktionary; he was told to be more active or else. There is a silver lining in that this guy announced to become more active on OmegaWiki. Obviously every project makes his bed and lies in it. We have chosen to have as little bureaucracy as possible. The question is very much; how is it going to scale.

OmegaWiki will expand by including "Wikis for Professionals". Each will include the terminology for a specific domain extended with specific information and functionality. With more people signing up to such a community, it may acquire its own rules. These rules should fit in the larger community that is OmegaWiki. What I expect is that often the unwritten rules will be the more important ones. In a Wiki for Professionals, people will be interested when the project is relevant. When this proves to be demonstrably so, it may become important to be identifiable to gain the benefits of the association with the project. The flip side of the coin is that negative behaviour can damage a professional reputation.

In a year, the community of OmegaWiki will be different. We work hard to provide it with an environment that will enable it to do good. At this stage it is still very much basic functionality that we are building. There is much new functionality and data waiting to go live. When it has, we will love to hear what is good and what could be better. We will love it when people help us morph our functionality and make our environment more relevant.

The only thing that we will insist on is that things can coexist and people collaborate, in that way we set not only the data free but also the imagination free, we will love it and we will set them free.

Thanks,
GerardM

Sunday, June 03, 2007

250.000 expressions

Today we reached the milestone of 250.000 Expressions at OmegaWiki. It is special because most of this data has been entered by hand. We find that when people get enthused by the concept of OmegaWiki, they do make a difference for the language that they champion.

We have people who have a particular interest in Georgian, Khmer and Spanish, it shows in the statistics as these languages grow much faster than the others.

Aveyron is the 250.000th entry in OmegaWiki and, it is only fitting that Ascánder was the person adding it. Ascander is one of the most valuable contributors to OmegaWiki. Aveyron is part of a project to include information from the ISO-3166-2. In this standard it is detailed in what way countries are subdivided. It does not state that Italy has provinces, the USA has states or that Germany has Bundeslander. It does give the names of these entities.

So, OmegaWiki is evolving nicely. We hope that in line with how Wikis evolve, we will have an easier time to get 250.000 more Expressions.

Thanks,
GerardM

Friday, May 25, 2007

After a week of hacking, testing !!

A lot of work has been done on the OmegaWiki functionality. We have been working on functionality that is of importance to the organisations that we hope to collaborate with.

There were several issues that we have dealt with:
  • Support multiple "data-sets" within a single OmegaWiki installation. These sets can be used to store imported "authoritative databases," such as scientific databases.
  • Users can navigate within a data-set or choose a different one to look at. The default set can be configured globally, for a user group, or for an individual user.
  • Different data-sets can have different permission levels.
  • DefinedMeanings in different data-sets that are identical (describing the same concept) can be mapped to each other.
  • When data is imported, we can choose which data-set to import it into.
There are several parts of the puzzle that are still missing; we are however at a stage where we need to test our data. So we are going to make this functionality go live soon. The first thing is to know that after all the database changes and much refactored functionality everything still works.

The next thing will be to experiment with a first authoritative or additional database. The obvious first resources are the GEMET collection and the ISO-639-6 collection. This is all in preparation of more partners that will be collaborating in the OmegaWiki environment.

More functionality will be implemented in the coming weeks:
  • The possibility to add multiple values without having having to reload the editor each time
  • Allowing for annotations that are dependent on previously set values; this will for the first time provide us with terminological functionality
  • More functionality is in the pipe line, I think you will love it when we have it :)
Thanks,
GerardM

PS It was a fun week, we had a day with a negative number of lines added. We had to change functionality to enable the software to run under Windows. To relax, I have read several chapters of Accelerando. It was fun to watch Kim and Erik work together, my appreciation for both grew. It was gratifying to see my dream become more of a reality :)

Sunday, May 20, 2007

Annotations, hyphenations and IPA

On OmegaWiki we aannotate. In addition to the sample sentences, it is now possible to add hyphenations. A thank you to Sean Burke and Kim Bruning who made this possible.. :)

It is also possible to include the International Phonetic Alphabet or IPA. On the one hand we should feel confident that people will do good. On the other hand, a lot of the IPA notations out there are not useful because they assume that the persons using it have a specific background.

In OmegaWiki we have a public that is truly multi-lingual. This is best experienced when you change the user preferences to another language. Most of the language labels may be shown in the selected language. The consequence of a multi-lingual public is that only IPA notations without language specific shortcuts are useful.

I am sure that you have an opinion about this, we hope to learn your arguments ..

Thanks,
GerardM

Monday, May 14, 2007

Domains and OmegaWiki

Some days Lejocelyn added a feature request about Domains on OmegaWiki. Being one of those points that are also most relevant to me personally of course I answered. Why domains are so relevant? Well: let's say we have 1.000.000 expressions for English-German for a translator, but for us only a certain set of data is relevant when we do translations, so having all 1.000.000 Expressions to search, with all potential results in our glossary window is some kind of an overkill and instead of helping you to find the right term it would take you the triple of the time you need to look things up in a dictionary (let's say about physics or medicine).

Dictionaries are general, yes, but then the amount of specialistic terminology is limited to what is most often used, therefore each of us still has these very special dictionaries about just one topic and these are our most valuable tools besides Internet (well yes, there are terms that are not in our dictionaries, so we have to search for them in available texts about the topic we are translating).

What I would like to say with that: domains might not be relevant to somebody searching for just one word every now and then, but they are most relevant when you want to use a ressource in a professional way.

Thanks for considering to have Domains within OmegaWiki.

Friday, May 11, 2007

OmegaWiki supports many linguistic entities

OmegaWiki aims to include all words in all languages and provide both lexical, terminological and ontological information. As the discussion of what makes a language is an endless one, those languages that are included in the ISO-639 codes are the ones that are supported.

Having chosen the ISO-639-3 to start of with has proven a great start. It did however not provide the granularity needed to categorize words to their linguistic entity. How to deal with languages that are written in several scripts, how to deal with regional differences? This is what this iteration of the ISO-639 standard does not deal with.

The implication is that this standard on its own does not suffice. By combining the data with other standards with other codes it is possible to provide more granularity, but how to deal with dialects like Westfries, that is spoken in the area where I grew up?

As the OmegaWiki project was evolving and getting traction, I got into contact with Debbie Garside. She is heading Geolang an organisation that has been preparing for a long time the next iteration of the ISO-639 standard, the ISO-639-6. The aim is to include at least 25.000 linguistic entities in a hierarchical structure. Adopting this data would allow OmegaWiki to better achieve its aim; include all words of all languages.

When a standard is published there is a prescribed period in which the public is invited to comment on a standard. So far this has been done using e-mail. Experience shows that when the amount of subject is too big, e-mail is not a tool to cope. Geolang had explored the option of using Wiki technology before, this sadly did not lead to the right synergy. In OmegaWiki however, there was both an active interest in language standards, it included not only the Wiki methodology, it even allows for the inclusion of the data in a true hierarchical way.

By publishing the data in a wiki, in essence everybody with an interest in orthographies and dialects is invited to comment, modify and add to the hierarchical data. To make this into a standard, there will be a need to assess the community generated data and assert the validity of the information provided. This is where the World Language Documentation Centre will play its role. As its name implies, it documents languages and it will do so in the broadest sense of the word. Obviously an organisation like this will only function well when it is as an organisation an inclusive organisation. The make-up of the current board reflects many specialities that make up linguistics and the language industry.

It is with a fair amount of satisfaction that I can announce that Sean Burke, one of the volunteers of OmegaWiki has imported the first batch of the ISO-DIS-639-6 data in time for the inaugural meeting of the World Language Documentation Centre. Both OmegaWiki and the WLDC will rely on collaboration, to get the necessary work done. Our challenge will be to provide the infrastructure and the minimal organisation to start and sustain our projects.

With the inaugural meeting, the WLDC it is proclaimed to the world that as an organisation the WLDC is ready for business. With the first data available in OmegaWiki, the first request to the world to collaborate on the languages that are spoken the orthographies that are written goes out. it is the start of acquiring the meta data that helps us understand the data that is already out there and consequently make from all this data information because we will become better able to parse the data.

Thanks,
GerardM

Sunday, May 06, 2007

New functionality

OmegaWiki has collections. These collections serve to indicate that certain DefinedMeanings are related. Collections can serve a purpose; the GEMET collection for instance is a resource that was the data that started our project. The OLPC collection is a list of the first words that we want in all language to start a multilingual dictionary for the OLPC project.

In these statistics, we have a tool to tell people what projects we have within OmegaWiki. This allows people to work on things that are of interest to them. The really sweet thing is that it shows like a work in progress, it shows what needs doing and, what has already been done.

There are several projects that are dear to me and can use more attention:
When you do not see your language in a collection, just add one word to any of the DefinedMeanings that are part of the collection and the next time it will be there. When your language is not supported in OmegaWiki, let me know and I will see how to remedy this.

Thanks,
GerardM

Friday, April 27, 2007

A similar milestone of a different kind

The last post of the OmegaWiki blog was about statistics. Kipcool indicated that there was an issue with the numbers, in the query the deleted Expressions were not considered.

Today we have the opportunity to celebrate anew one of the milestones that went before. Today we have a cool 10.000 expressions in Italian. :)

I want to thank Kipcool, Kim Bruning and Zdenek Broz.
GerardM

Monday, April 16, 2007

Milestone

The nice thing of statistics is that when there is a milestone, it can be celebrated. It is with pleasure that I can announce that OmegaWiki now has one language with 15.000 words. It is German that has currently the most expressions. There are currently 213.550 Expressions in 133 languages.

For OmegaWiki it demonstrates that we have a nice autonomous growth. Of the languages that we started after the import of the GEMET data, Japanese currently has the highest number with almost 4.000 words. Esperanto is with some 1.500 the biggest artificial language.

You can help us with translations for the language names that we have translations in. This will improve the usability of our data when the user interface is selected in the language of people's mother tongue.

OmegaWiki is still in need of more functionality. This is something that we try to achieve in any way possible. It is however rewarding to notice what difference the existing functionality makes.

I am very happy with what we have achieved so far. It bodes well for the future :)

Thanks,
GerardM

Monday, April 09, 2007

Fall back languages

OmegaWiki has a relevant bug fix; it is now obvious what option you add a part of speech you will see "part of speech" in your own language and when we do not have it, you will get it English as the fall back language. This is a huge improvement from having the same text in some 24 other languages.

The next improvement will become possible when the Multilingual MediaWiki is finished; this will allow you to select the languages you are interested in. These languages are more personal than just falling back to English.

Another option would be to identify a fall back language for a language itself. This makes sense for languages that exist in a space where another language is well known and better supported in MediaWiki. Languages like French, German, Italian, Spanish, Portuguese and Mandarin come to mind.. I am sure there are more..

For the Incubator we want to define fall back languages to make it easier to localise the MediaWiki messages.. We hope to get answers what the best choice is.. otherwise we will have to guess..

Thanks,
GerardM

Friday, March 30, 2007

Protecting people from poisonous personalities

The BBC website has an article about the culture of abuse on line that allows people to be noxious, abusive, threatening to people and in this instance to women. The thing that sparked this off are death threats to a prominent blogger, Ms Kathy Sierra. I read a follow up article on oReillynet called "Open season on women".

We have been extremely lucky and fortunate on OmegaWiki so far with abuse and vandalism. We have also been extremely lucky with our community. I do not exaggerate when I say that we have some great females being extremely relevant to what we do. If both articles need an official reaction, it would be that we will not tolerate such things on OmegaWiki and, that I am still appalled by the way the Wikichix were driven away from the Wikimedia Foundation for all the "right" reasons.

At this stage in the life cycle of OmegaWiki, I can say that people who are deemed to be poisonous in what they do will find that there is little toleration for them. When we get to the stage where it becomes an issue if we should tolerate poisonous people, I can tell you know that I am prepared to fight tooth and nail to get such people out.

Thanks,
GerardM

Saturday, March 24, 2007

More new functionality & some stats

Lies, damned lies and statistics.. I always thought that OmegaWiki had more English words than German words. This turns out not to be the case, then I thought it may be because we make something English (United Kingdom) or English (United States) when the word is not shared between the two.. There are only 326 UK words; that does not fill the 441 gap.

These and more pondering are possible because of our new statistics functionality we have courtesy of Zdenek Broz. It shows you the actual numbers of Expressions in OmegaWiki.

Some more factoids, we now support 130 languages, 28 of these have less than 10 words.. When you compare us to Wiktionary, we would be the fourth in size when an article is considered the equivalent of an Expression or we would we the twenty-fifth in size when a DefinedMeaning is considered in this way. When you consider that all of our growth so far has been autonomously, I think we are doing well.

In the mean time, Alexa considers us the 480,601 website without statistics for the week.. I do not know what goes on there. I do like their T-shirts though.

To quote the Alexa T-shirts: "Will dance for better Alexa rank " :)

Thanks,
GerardM

Thursday, March 22, 2007

New functionality

OmegaWiki has new functionality, there are two bits of functionality. One bit is not obvious but really helpful; a function to (re)build the indexes of the database. What it does is ensure that the indexes are build in the right order. This improves performance considerably.

The other bit, is much more spectacular. It sorts the tables in the HTML based on the content of the first column. The order depends very much on the language selected in the user preferences. This leads me to my question; is it sorted well when your language is Persian, Arab or Hebrew. These are right to left languages and I have no clue if the sorting is done well.

Thanks,
GerardM

Monday, March 19, 2007

How do the four freedoms apply to one database?

The FSF defines four freedoms when it comes to software. What kind of freedoms applies to a database like OmegaWiki.

The OmegaWiki software is licensed, like MediaWiki what it is an integral part of, with a GPL license. This means that you can use the software as is. As OmegaWiki uses specific data structures that can be licensed separately for completeness sake, the database design is also available under a GPL license.

The data that is contained in OmegaWiki is licensed under a combined GFDL/CC-by license. Many people insist that these licenses are not compatible. At issue is that the data are just facts, it is only possible to copyright facts as a collection. We want people to make use of our collection. For us success is: "when people find a use for our data we did not think off".

We invite people to collaborate on our data, when they enter some Babel templates on their user page, we give them edit rights to the data. We invite organisations to collaborate on our data because there is so much data that organisations can share, there is so much labour invested in the type of data OmegaWiki can be a home for.

So the data and the software is Free. How about OmegaWiki itself ..

As there is only one OmegaWiki.org, the room to do whatever is limited. The data has to be useful to everyone and it has to fit in with the notion of the DefinedMeaning. When domain specific data is added, it needs to be domain specific and, there has to be agreement that this data provides a suitable extension for people involved in this domain.

When people find that this is not enough, they can have their own database. This does not mean that they cannot cooperate. Much of what they need in terms of extra functionality will be shared. It means that even when the OmegaWiki database is forked, there is still plenty of scope to improve all the things we do agree on and collaborate on those.

There is plenty of Freedom. All the Freedoms we provide. I think however we achieve the most success when we find that there is more that binds us than that drives us apart. It means that we have to work hard in understanding what our common needs are.

Thanks,
GerardM

Saturday, March 10, 2007

Managing data with some SQL

The great thing about OmegaWiki is that the data is in a database. You might say that this is not that special, every wiki uses a database. Today, we have as a first time done some curation on the data; everywhere where a word in en-US was written exactly the same as in English, we have deleted the English. One example is the word "competition", in the history you will find the deletions.

I am really grateful that Leftmost has started to use SQL to fix things for us. It saves us what is most valuable; the time of our editors.

There are other things that we can do, I have asked to have all Bulgarian words that are capitalised changed to lower case where the Russian words are lower case. This is to fix something that is done consistently this way in the GEMET database. With these improvements, the GEMET data becomes usable for other purposes; things like data mining .. :)

Thanks,
GerardM

Sunday, March 04, 2007

About reputation and education

Wikipedia has a new scandal. There are several issues here and, several interested parties. The scandal is about someone who is known by a nickname and who claimed that he had academic qualifications. He claimed to be a doctor of theology.

One of the interested parties who obviously relished this occasion was Dr Sanger of Citizendium fame. For Dr Sanger it is one of those occasions where he can sing the praises of his project. He did, I did not read anything new there. In the mean time, his project was announced in Nature and now, so many months later, there is still nothing to be seen. That does wonders for the credibility of that project ...

Given that scientific credentials will be very relevant on OmegaWiki, I have given it some thought. When people want to claim professional credentials, they would have to provide us at least with their real name and their e-mail address. These would be the requirements on the level of OmegaWiki.

For the Wikis for professionals, different rules might apply. When medical credentials are claimed, wrong information can kill. It is for such reasons that much more identifiable information is likely to be required. This does not mean that credentials are as important as Dr Sanger says. Relevant is the quality of the information provided. Relevant is the role a person plays in the community. This means that a person can become relevant in OmegaWiki by building a reputation. For this you do not need scientific credentials. Science has, like every part of society, its fair share of miserable people and I am sure Citizendium will learn that as well.

If this incident is to be a lesson, the lesson is that it does not pay to assume credentials and have a virtual reality meet real life. It is therefore sad that Essjay is now "retired". If this means that the person who did a lot of good work will do no more, than it is a sad outcome. If this is the only time that the issue of assumed credentials raises its ugly head, it has a silver lining. If this incident is sufficiently public that it will stop this phenomena, then I will be glad.

Thanks,
GerardM

Wednesday, February 28, 2007

What is a Word?

First, let me make it perfectly clear that this is a discussion that has raged for centuries. I know full well that everybody has his or her own opinion on this matter, and that I am not going to resolve this issue today. This is an overview and a bit of personal opinion, as relates to online dictionaries.

Intuitively, we all know the answer. A word is a unit of language conveying some meaning. But how do we decide what is a real word? We look in a dictionary, of course. What do we do if we're writing a dictionary?

We are caught between cataloging what is "right" (prescriptivism) and what is actually done (descriptivism). The pendulum has lately swung towards descriptivism, and I would say that there are some good reasons for that trend. The language that is spoken on the streets is not the same language that is written in academia. Somebody learning a language may genuinely need help sorting out the less proper terms in it.

Take, for instance, colloquialisms such as "irrespective" and "humongous", and all the phrases that have gotten squished together into amalgams like "gotcha" and "woulda". Most people would readily agree that these words do not belong in a college thesis paper.

What is the scrupulous lexicographer to do? Fortunately, it is not a strict either-or question, especially in a work not substantially limited by size. In an electronic resource, we can put them in, anyway. To satisfy the formal sorts, the perscriptivists, we can then place a prominent usage note in the entry, explaining just why a writer might wish to use caution with the term: ginormous is a colloquial term, regarded by many to be something less than a proper word. Thus, the reader is both informed and cautioned.

That's fine for most of the slang and jargon, but we have another problem. People keep making up new words. My sister in law coined the term "muskaroon" to mean generically any small, furry creature that scurries past too quickly to identify. Squirrels, chipmunks, gophers, and presumably rabbits would all qualify. So we have a unit of language with a symbol and a meaning. The trouble is, if you walked up to people on the street and inquired whether there were muskaroons in the area, nobody would be able to answer who hadn't talked lately to my sister in law, and that is a small minority of people, indeed.

The test here is usage. Can we demonstrate that the word is in common use? Now, depending on the character of the dictionary, we can define the rules various ways. Was it used by so many independent sources? Did anybody important (such as Shakespeare or a prominent academic journal) publish the word?

Generally, we also try to find and present examples of the term in what is called "running text". That means that it is in a paragraph, and isn't only used as somebody's nickname, say. The edge is still a fuzzy one. Are the citations in traditional print sources, such and books and journals, or are they sprinkled in a couple blogs and forums? Was the word used in only one limited context, or in a variety of sources and over a period of years? These sorts of tests can help to weed out many of the more questionable entries. At some point, though, it may yet come down to a judgment call, if not on whether a word is real, then on how to apply the rules. In these cases, I advise the users of a dictionary to bring a healthy dose of skepticism with them, to recall that even dictionaries are not infallible, and to trust at the very least that these decisions are made by real people who care for the project.

If, knowing all that, you find you don't like the way "they" are running the place, you are invited to do a better job.

Monday, February 19, 2007

Anyone may edit

Written in response to this project.

Somebody asked me today about what happens to a dictionary when anybody can edit it. As anybody who has ever edited a wiki knows, the openness is a mixed blessing.

It is a great thing, because many hands make light work. Dictionaries need to be every bit as large as the languages they catalog, so the process of gathering and maintaining the data is a huge one. As we start to add translations between languages, rather than simply defining a term, that task becomes orders of magnitude bigger. To capture all words in all languages is something that will take nothing less than a wiki and a worldwide community. It is a monumental task, but in a wiki, we can conceive of creating a resource on such a scale.

It is a great thing because so many regions and cultures can be represented. An American may understand most of the English spoken in South Africa or New Zealand, but both of those regions have slang all their own. Chile speaks Spanish far differently than Spain. All those variants can have their place.

It is a great thing because a wiki can evolve with a language. New terms come into use all the time, and a freely editable electronic resource is not limited in its capacity to store data or to accommodate a large, diverse set of editors.

The big trouble is this: if just anybody can edit, how on earth do we know it is right? I'd like to explore a few approaches here. Of course, any of these approaches could be considered as a barrier to entry, but these things are always trade-offs.
  1. Appoint trusted users to do the housekeeping. These are the sysops, administrators, bureaucrats, the librarians, or the janitors, depending on your point of view. These somebodies keep watch and undo the damage that some of the just-anybodies can do. If somebody writes an article containing typical vandalism, such as "asdfasdf" or "Dave is a dork!", an administrator can delete or undo it. Much vandalism is so predictable that even a bot can detect and remove it. Unfortunately, a select group of administrators, however well trusted or well-read, cannot be everywhere at once, and they cannot know everything. Things get missed, even with a checklist system such as patrolled edits. It is likewise impossible for an administrator or small group of administrators to know everything. Misinformation, intentional or otherwise, is not so easy to spot as out-and-out nonsense.
  2. Hold people accountable. Articles have histories, so you can see who did what. Even pseudonymous users develop reputations. Anonymous users tend to attract the most scrutiny. An active, healthy wiki often develops into a meritocracy, with leaders having sway (though not necessarily authority) based on reputation, seniority, and trust in the community. This effect generally works to improve content, but even a well-known, trusted user may make mistakes. If he or she is trusted well enough, there is a risk that an error or oversight may go unnoticed.
  3. Allow anybody and everybody to scrutinize and correct or flag the content. The process is not foolproof, especially in larger projects, but wikis have a remarkable capacity for self-cleaning. Of course, this approach can tend to result in a sort of groupthink effect: if enough people believe it, then it must be so.
  4. Demand credentials. Don't just let in any old riffraff. Wikipedia has clearly shown the power of amateurs and volunteers to create great content, but it is certainly possible to limit the users in a project, or part of the project, to a certain group. This approach is most appropriate to a wiki serving a closed community, such as a professional or academic group, especially one dedicated to a particularly narrow or specialized topic.
  5. Make the messes behind the scenes, and publish only the good stuff, with some review process. The German Wikipedia published a paper book containing selected articles. Online, there have been proposals for a "Stable Versions" system, where a mature article would be reviewed and locked, and any additional changes would go through a separate editing or discussion page.
  6. Demand references. There is a movement within Wikipedia to reference the articles and the claims made in them. In the context of a dictionary, references may be other dictionaries. Is the word recognized by RAE or OED (whom we trust to have done the requisite homework)? They may be other works about words. Or, they may be citations. Citations are quotations including the word in question. They show context and provide evidence that the word is or was in use. Of course, we must still question the validity of the evidence. Are the 400 Google hits because somebody prolific uses that nonsense word as a handle? Is a word more valid if it was used by a blogger or two, or by Thornton Wilder? Is an etymology known with reasonable certainty or is it apocryphal? Depending on the size and resources of the wiki, efforts to verify and reference articles may be systematic, or they may be requested when a given entry or fact is questioned.
A wiki is simply a website where anybody can post. With a bit of care and attention, its content can be as valid and accurate as any other reference, and certainly more complete and up-to-date.

Friday, February 16, 2007

An article in Nature ...

I am absolutely thrilled with the article that was published last Wednesday in Nature... It is a great article and it explains really well what we hope to achieve with relational data in MediaWiki. The only thing that is a bit sad is, that you have to pay $30 for the privilege of reading it.

The article is great, and what makes it special is the great presentation that Knewco has created to explain what we hope to achieve; this demo available at wikiprofessional.info. It presents some really impressive figures; it indicates the work done to integrate several important resources of the bio-medical domain, the numbers involved.

For me the most important point is that this is likely to be a very important stimulus to the Open Access movement. It indicates that it is possible to bring what was divided together. It allows people to work with the terminology of their field and also add data that is very specific. Information that goes much further than what was envisioned in what was once called the "Ultimate Wiktionary".

The whole notion of a resource that because of its roots already merged lexicology, terminology and ontology is really special. With the integration of such specialised data from different domains like the bio-medical, another really interested experiment will be under way when the data gets imported and merged. There is a nascent community for the bio-medical domain and, it will find that it will co-exist with the existing OmegaWiki community.

Both communities have everything to gain from collaboration; much of what the existing OmegaWiki community cares about will be seen as a fringe benefit. On the other hand, the translations that exist for concepts like malaria will prove to be of value when scientific articles are considered that were not published in English.

I am convinced that a bright future is ahead of us. We have this vision of what may come, I wish I could look into the future and see what it will be like. :)

Thanks,
GerardM

Sunday, February 11, 2007

Why compete when you can collaborate ?

All words of all languages of the world.. that is what we eventually aim to include in OmegaWiki. This aim is of such a magnitude that you have to be certifiable to come up with such a project. The functional design for the project includes much more; everything including the kitchen sink..

When everything is to be included in one project, it is easy to suggest that people contribute to the project. When the project includes everything why have another?

In an Open Source / Open Content environment this is not necessarily how it works. Why should the others be seen as competitors? They do their own thing, sure. You may want to achieve the same thing, also true. It is however much possible to find the synergy between projects. This way you can build on each others accomplishments.

The Shtooka project is something I learned about the other day. The one thing it does really well is the way they make recording pronunciations easy. You can record a string of words and it will save them for you one at a time.

Wiktionarians saw this and they are working an upload facility so that it will also be saved automatically to Commons. I warned that the files should not only be saved as .ogg files. In order to make sure they are relevant for scientists there should also be a .wav file. The current thinking is that the flac file format will work as well and the benefit is that it provides a loss-less compression. To make sure that this is the case, the praat software, software that is also available under a GPL license, was analysed and it was considered that it is easy to incorporate this flac file format.

People from effectively five different communities are now working together. It will be even possible to include links to OmegaWiki in the Shtooka meta data. This will be possible even though both projects do their own thing. Both the data and the functionality can be shared.

I may be certifiable, but this kind of collaboration is awesome and, it is why there may be method to this madness.. :)

Thanks,
GerardM

Thursday, February 08, 2007

Become an OmegaWiki developer

OmegaWiki is now running the latest version of the MediaWiki software used by Wikipedia. This is a major milestone, as it also makes it a lot easier for anyone to join in the fun of developing the open source OmegaWiki/Wikidata software. To give credit where credit is due, these are the people who have contributed to the code so far:
  • Peter-Jan Roes
  • Karsten Uil
  • Sean Burke
  • Rod A. Smith (sticky tree expansion via cookies)
  • Ævar Arnfjörð Bjarmason (namespace code installer)
  • Charles Pritchard (Multilingual MediaWiki development, ongoing)
  • Jelte Zeilstra (untranslated meaning script, under review)
  • Zdenek Broz (statistical scripts, under review)
  • Paa-Kwesi Imbeah (Wikimedia Commons support, under review)
  • Marc Carmen (TBX export, incomplete)
  • myself
There are probably others I forgot. These people, some of them volunteers, some paid developers, are helping to build the first truly multilingual, massively collaborative ontology. If you want to become a part of this history, there are now instructions that should help you get on your way. Please contact me under erik AT openprogress DOT org once you have read and followed these instructions. There are always plenty of things that need doing. And as the organization which runs OmegaWiki, Stichting Open Progress, develops more and more partnerships around the project, we will look to our team of existing developers to help us implement them.

Wednesday, January 31, 2007

Greek languages

At OmegaWiki, we saw that Lou started to change the capitalisation of language names... A few days ago I was surprised that the Georgian names for languages were incorrectly spelled. Now it is Greek.

It is really powerful to see that by having the languages corrected, it will be available for everybody who wants to know about Greek. This reason for using OmegaWiki proves itself again.

Thanks,
GerardM

Sunday, January 28, 2007

Latin roots etc.

Well yesterday one thing came into mind - a dictionary a teacher of mine at the language school had. It was a dictionary that listed Latin words with many translations into other languages and one thing is obvious: all these words of course were similar in all languages. If you knew one of them and studied the other language it would have been easy to create the relative words following a set of rules for most of them.

So one thing should be obvious: to insert these words with their translations into OmegaWiki ... but well, there is one problem with Latin - the "normal" Latin language should not be mixed with the taxonomical Latin that is used in science ... so we need to create two languages: Latin and taxonomical Latin ... who knows if the relative language codes exist somewhere in the ISO 639 standards.

Technorati:

Friday, January 26, 2007

OLPC needs a dictionary viewer

I had a word with the director of content for the OLPC, the One Laptop Per Child Project. As you know OmegaWiki is the project that works on providing the OLPC with dictionary content. We are working on all these words, and while we are making steady progress, there is so much still left to do. We are getting more Expressions in many languages, the definitions are lagging and while we do our best, it is still very much the difference between there being nothing and there being next to nothing. It does however show that things are getting under way...

As the moment when kids are exposed to the systems is drawing closer, it is relevant that the data can be used. So we need a dictionary viewer. It needs to run on Linux and, it should have a small footprint. As we will provide all these languages, it will be interesting to see how the rich tapestry that OmegaWiki tries to weave will materialise on these nifty systems.

When you have a suggestion, please let us know :)

Thanks,
GerardM

Wednesday, January 24, 2007

Georgian names for languages

In the past I got permission to copy content from a resource with the names of languages. I am still grateful for the data. It got the Dutch Wiktionary going really nicely and, as we needed at the time those names of languages for the user interface.

With OmegaWiki we had the same issue; we needed language names again for the user interface. This was to make it possible for people to see the labels of translations in their own language. From the moment the data became available we have learned a lot, for instance that language in languages like Danish and Italian do not capitalise the names of languages.

Today I was told that many of the names of languages in Georgian were found to be in error and had been corrected. The great news for OmegaWiki is, that we only have to do this once and it is good everywhere. The sad thing is that it is probably wrong in many, many Wiktionaries. There were two types of errors; it was just wrong or it was the name of someone from a country in stead of the name of the language.

The best I can do for the Wiktionaries is notify in this way as I do not really now what needs doing.

Thanks,
GerardM

Tuesday, January 23, 2007

Stichting Open Progress

Stichting Open Progress is the Dutch not for profit organisation that is the legal organisation behind OmegaWiki. As OmegaWiki is growing to the extend where we have to consider contracts for hosting, grants and the like, we had a need for an organisation.

The need for an organisation was also felt as we already had some projects where we would have been better able to do things when there was a legal entity backing up the activities. Some of these projects are quite substantial.

Open Progress aims to develop both Open Source/Free Software and Open Content/Free Content projects. As part of its mission it gives room for projects that are aligned with the aims of the stichting. Obviously OmegaWiki is the first; from an organisational point of view, the OmegaWiki commission decides on the issues that arise. Resolution will be enacted for the project by the stichting provided they are in line with the Dutch law and, provided they do not circumvent the aims of the stichting. This way Open Progress hopes to make OmegaWiki a safe haven where people and organisations work in the understanding that the aims of the project will be respected.

There are two websites for OpenProgress; in line with the experiences of the Wikimedia Foundation, we have both an internal and an external wiki. The internal will use Semantic MediaWiki to leverage as much as possible the information that we will include. As the information will include both personal information and confidential project information, the internal will be invite only.

Thanks,
Gerard Meijssen
voorzitter Stichting Open Progress

Monday, January 15, 2007

Destinazione Italia

Destinazione Italia is a project of the University of Bamberg. It provides training for people learning an advanced level of Italian. Bamberg is a German University and many of its students are German. Many of the students do have a different mother tongue. Learning a third language based on the knowledge of a second language is less effective than learning based on the knowledge of the mother tongue.

I am really proud to announce that OmegaWiki has been selected by the University of Bamberg as the platform that will host the lexicological information for "Destinazione Italia". The initial phase of the project will create a lot of Italian based DefinedMeanings. In the second phase we will translate these words to English, German and Spanish. The third phase is to find translations in as many other languages as we can get.

Research done by Zdenek Broz learned, that when the combination of quality translations of German, English and Spanish is found, it will allow the inclusion of translations of other languages when these translations are shared in a different resource. According to Zdenek's figures this will get us an accuracy of around, probably better than 95%.

There is a budget to get us many translations in other languages. The sweet thing is, when we are able to provide quality translations, the budget can be used for other things. This can be to improve the OmegaWiki usability, it can also be to spend money on a language that is not part of the initial list of languages "Destinazione Italia" supports.

The challenge is therefore, how much can we do with a limited budget. What will be the added value of creating content in a Wiki environment. When will OmegaWiki reach the tipping point where collaboration in OmegaWiki is the obvious thing to do, "Destinazione Italia" will help us reach that point. :)

Thanks,
GerardM