Sunday, December 31, 2006

When a year ends, it is an opportune time to reflect and to look forward. The past year was pretty amazing; WiktionaryZ went from nothing but a proof of concept to OmegaWiki with functionality that includes language depended part of speech annotation. I want to express gratitude to all the people that made this possible, I want to particularly thank Knewco for their growing belief and understanding what Open Source, Open Content and Open Access. They have not only been instrumental in the development of OmegaWiki, my prediction is that their contribution will help the evolution immensely in the thinking how to make all this sustainable.

OmegaWiki is very much an experiment; it brings organisations and communities together. In the development of OmegaWiki we have seen how immensly valuable both are. It is because of this that I find it so disappointing that the Wikimedia Foundation finds it so difficult to entertain the possibilities that such cooperation brings.

As the WMF decided that they were not in a position to host OmegaWiki, we are now in a position with Open Progress, the Dutch not for profit organisation we had to set up, to do things different where we think it makes a difference. This notion of "eating our own dog food" has always been an important part of the way OmegaWiki works; in the same way we hope to implement the ideas that we agree on in our community.

The most important notion of OmegaWiki is in its definition of success; "success is when people find an application for our data that we did not think of". The implication is that the data of OmegaWiki is there to be used and that collaboration is what OmegaWiki is about. Collaboration in a way that recognises that the success of OmegaWiki is integral to the success of everyone who helps OmegaWiki to be a success.

Technically OmegaWiki is near the turning point where we have sufficient capability to host several ontologies together. This will coincide with the arrival of "collection relation types" and "domain relation types". With the arrival of language dependent attributes we are on the threshold of being able to include the information that differentiates one linguistic entity from another.

Both these two capabilities will be really important in 2007. They will bring a lot of relevant data to OmegaWiki and is likely to bring relevance to the project. I envision that we will indicate which UNICODE characters make up the standard characters for a language. It will show among other things where more characters are needed, it will also help define what the proper sorting order is for a linguistic entity.

In a mail to the Yahoo aphrophonewikis group, Don Osborn asked for a "Year of Unicode in Africa", I hope that OmegaWiki will help make this happen. In a mail to the Wiktionary mailing list, Javier Carro suggest to collaborate on what he calls "Schemes". Both are two mails of the last week, I expect the implementation of both will be feasible.

I am sure 2007 will be great .. Prosit Neujahr :)

Thanks,
GerardM

Saturday, December 30, 2006

Relevancy

Today it was in the news that Saddam Hussein was executed. He was no choir boy, there was a trial. Many people are happy with his death, many people are unhappy with his death. I do not want to express my opinion; it is not relevant.

With occasions like this, it is important that the words that can be associated with such an event are understood and available in resources like OmegaWiki. The words that are in news items are the words that need to be explained. The figures as they exist for Wikipedia show that the articles that are most visited are to do with sex, sport and news. I expect that this is also true for a dictionary. There are no statistics that I am aware of.

The word of the day for tomorrow could be gallows but I think that such a word on the last day of the year is a bit much.. Justice is much more appropriate.

Thanks,
GerardM

Tuesday, December 26, 2006

A flurry of activities

With the new part of speech functionality, OmegaWiki sees a lot of activity of another kind. For the first languages some of the parts of speech have been indentified. The system is by necessity laborious; all the parts of speech have to be identified for all languages because we do not assume that a particular part of speech exists in a language.

Siebrand
did a lot of work on identifying the parts of speech for the Dutch language. As a consequence we do not only have the "verb" but also the "copula". The idea is that when people know how to identify a verb as a verb, they can and may. When someone is able to identify more precisely, they can. In the mean time, when we get functionality for inflecting verbs, it should work on both.

Having functionality come on-line in small bytes, is in line with the motto of Open Source/Free Software; publish often. It really helps. I can imagine the many refinements and expansions on what we have at the moment. It is relevant to realize that our software is still very much pre-alpha. It is not complete, but it demonstrates how the functionality is growing making our dream a reality.

Thanks,
GerardM

Sunday, December 24, 2006

It is the night before Christmas

On the night before Christmas many people are full of anticipation of the presents that Christmas will bring. For OmegaWiki, we hope / expect that we will be able to have part of speech support. The software has been coded. The waiting is for the final touches and to see it enabled in OmegaWiki. Leftmost does a sterling job for us..

In order to get here, many hurdles were taken. First there was a need to have default behaviour. This led to the sample sentence functionality. With the part of speech functionality, we can have a list of values, we have functionality that is dependent on the language it applies to.

With the implementation of the functionality, there will be a need to identify what parts of speech exist in a language. This meta data needs boot strapping, so we hope people will add the parts of speech to the languages they know well. We hope that they do this well because correcting meta data is problematic.

عيد الميلاد السعي geseënde kerfees Sretan Božić Καλά Χριστούγεννα

Thanks,
GerardM

Thursday, December 21, 2006

OmegaWiki now supports Cebuano

Cebuano is one of the languages spoken in the Philippines. Some 20 million people speak it as their first language and some 11 million speak it as a second language. It is good that Cebuano is now enabled for editing.

There is a Wikipedia in Cebuano, as far as I can tell it has been localised in MediaWiki. This means that when the names of languages are translated, OmegaWiki will start to look attractive for the people who speak Cebuano.

Out of interest I checked if Open Office supports Cebuano. Googling learns that people use OO in Cebuan, but there is no official localisation for Cebuano and there isn't one for Tagalog either. Open Office does not support many languages and it will be hard work to get the user interface localised in more languages.

This does however not mean that Open Office cannot create content in Cebuano. Of importance is that OO is able to indicate what language people use. It is not clear to me how to do this; it seems that OO only allows for the use of languages that it fully supports. This is in my opinion not the way to approach it.

If Open Office allows for people to select the languages that they edit in, and the languages are everything that ISO-639-3 supports that is written, it should be possible for people to select the user interface that suits them best and even allow for the use of spell checkers that are created for these languages. With the correct tagging that is implicit in using the language tags, it will become easier to support the documents that are produced because it is then possible to explicitly know what language a text is in.

The questions for me are:
  • Is my analysis correct .. please tell me it is not ..
  • How to convince the OO people to support the use of all recognised languages that can be written
  • Get support for spell checking in those languages as well.
Thanks,
GerardM

Tuesday, December 19, 2006

The use of Standards to emancipate languages

More and more languages are supported by the localisation efforts of projects like Open Office. This has a major effect on the emancipation of these languages; much content is written and some of it ends up on the Internet. When it gets there, its information is often lost because it cannot be found by a user who is not sophisticated in the use of search engines. Sophistication is needed because Google currently only recognises 100 languages and only 15% of the content of the World Wide Web is tagged to indicate a language and much of it is tagged incorrectly. When the quality of the tagging is improved, it will be possible for search engines to provide information that is only in the requested language. This will have the added benefit that a growing corpus of content will become available and this in turn will stimulate the research in these languages.

OmegaWiki is a wiki based website that aims to provide information both of a lexical, terminological and ontological nature. It does this by extending the MediaWiki software with relational functionality. OmegaWiki aims to have all words in all languages.

In order to learn what languages exist, OmegaWiki adopted the ISO-639-3 standard. This leaves out many linguistic entities like dialects and orthographies. OmegaWiki has had the good fortune that it got into contact with the WLDC and GeoLang who are the organisations that deal with the ISO-639-6 standard that is under development. Together with the WLDC it has been proposed to the ISO task group to use the environment and the functionality that OmegaWiki provides to gather the data that is needed to learn about the different linguistic entities. This has been accepted by the task force at the LSGB conference in Vienna.

Much of the groundwork has now been done, the next step is to make this functional. To make it functional, we want to have software adopt the existing standards and the information provided in OmegaWiki. This is feasible for Open/Free software. We have already approached the OmegaT lead developer and, he will be happy to support this because it will make OmegaT more relevant. Because of what OmegaWiki aims to do, we will be able to build spell-checkers for linguistic entities on a regular basis, this in turn will have an impact on the standardisation and the emancipation of languages.

With the emancipation of languages, it will become increasingly an option to bring information to people in their native languages. Studies have shown that this leads to a much better understanding and appreciation of services provided. Particularly in what is called, the long tail of the language industry, there has been little support for any of the standards. Consequently the quality of service provided when translating for languages in the long tail are inconsistent. By developing the tools to include the support for any linguistic entities, it will become possible to use these languages for written communications, it will also become possible to raise the quality of these communications by applying the work that has been done in the language industry.

To make this project take off, it works best when many people and organisations collaborate. All of them will have their own reasons to want to be included. The challenge will be to coordinate things in such a way that all the necessary parts are realised.

There are sufficient reasons for many organisations to buy into this project. What is needed is not only to leverage this but also to explore how the data that is gathered, the functionality that is build can be extended to provide additional information that is of relevance to the project. Many organisations will have a need for information, we will want either their active participation and / or their support. We will find all types of technical complications, we need the buy in of the people that can resolve these issues.

All in all, it will take a lot of effort to provide the difference that we aim for.
Thanks,
GerardM

Tuesday, December 12, 2006

WiktionaryZ becomes OmegaWiki

The OmegaWiki project aims to provide information about all words of all languages with a user interface in all languages. It achieves this by using relational database technology extending the wiki paradigm. This project attracted attention from organizations that made the development of relational database technology from inside the MediaWiki software possible.

OmegaWiki originated from the 171 Wiktionary projects of the Wikimedia Foundation. Originally the project was called "Ultimate Wiktionary" it was renamed to "WiktionaryZ" and now it will be known as OmegaWiki to prevent confusion with its Wiktionary sister projects.

The OmegaWiki project will include specialist terminology and will have an ontological component that will allow for the inclusion of specialist thesauri. Due to the experimental nature and to some of the requirements make that the Wikimedia Foundation is not in a position to host the OmegaWiki project.

To allow for the further development and hosting of the OmegaWiki project, the Stichting Open Progress, a foundation based under Dutch law, will assume the responsibility for the project. The "WiktionaryZ commission" will continue to take care of the day-to-day issues for the OmegaWiki project and, it will be renamed to "OmegaWiki commission"

The content of OmegaWiki will continue to be licensed under a combined GFDL and CC-by license. This is to enable what the project has defined as success: "success is when someone finds an application for our data that we did not think of".

The Wikimedia Foundation and Stichting Open Progress will continue to work together to achieve their shared objective; to bring information to all people of this planet. For more information:
We hope and expect that the Wikimedia Foundation and Stichting Open Progress will be able to work together to achieve their shared objective; to bring information to all people of this planet.

For the Stichting Open Progress and the OmegaWiki committee,
Gerard Meijssen

First post

This is the first post for this blog. It is created as a consequence of the name change of the WiktionaryZ project. For information that went before, please have a look at the WiktionaryZ blog.

Thanks,
GerardM