Integrating Wikipedia
From Devwiki
Ka Kan Lo
Semantic Web has been of great interest to a wide varieties of communities such as in Wikipedia, Intelligent Systems and applications (Knowledge Systems), Medical research, Legal ontology, Enterprise information management, Social Network. Before Semantic Web, there has been some attempts to build such system by a number of institutions with the notable example of CYC. It has been envisioned that applications and web with Semantic capabilitiy will enhance greatly the interaction between human users, information and machine. However, so far we miss the "knowledge" - RDF, OWL in Semantic Web, E-R relationships. To make it precise, we need a scalable, robust and efficient infrastructure to feed the RDF and E-R into the targeted applications and information source. This work is about to convert any probable information sources like Wikipedia and many others, as found in online or offline world, to large scale knowledge base in RDF and E-R formats so as to feed into other applications. The prototype is run in three different languages - English, Chinese, Arabic. The potential of using this knowledge base to Semantic Application such as Discovery, Retrieval and Extraction of information is also demonstrated. Users can also download the knowledge source (RDF, E-R) generated for uses in their own applications. The application can be found at: http://www.epistoc.com/ and it is scheduled to be launched during the Wikimania conference.
Wikipedia has long been regarded as a huge knowledge repositories built by collaborative and cooperative characteristic of the communities. Up to now, there are major two end parties involved in the usage of the Wikipedia projects, editors and users or to put it into more generic terms - readers and writers. Through reading and writing through the huge linkage of Wikipedian articles network is the currently only way to access the huge knowledge resources inside, it does not rule out the possibilities of utilizing more sophisticated and advanced techniques in allowing more open and easier access of Wikipedia. Previous Wikipedian talks involved in establishing the Semantic Web interface for Wikipedia to make the knowledge inside more easily accessible to users and even for machine to do automated tasks. Our previous presentation in Wikipedia involved converting the Wikipedia into a huge network of knowledge repositories to turn it into the raw materials into an automated question answering system and got state-of-the-art performance.
This presentation is about integrating the Wikipedia network of knowledge infrastructure with the existing World Wide Web network to make the content to be more universally accessible to larger audience of users. The presentation is supposed to be presented in two major sections. The first one is the Semantic Web.
It has argued that Semantic Web would be the future of the Internet and Search. Though not proved, many communities have been devoting tremendous efforts to realize the goals. Traditional Semantic Web has been regarded as dumped and impossible. For the critic of the project, we can visit the Semantic Web entries in Wikipedia. However, one of the major culprits which have hindered the advanced explosive growth of the Semantic Web have been the lack of annotations and an agreed set of terms and lexicons to tag the data. The utilization of the human labors in tagging the knowledge network is huge and it has been estimated that any efforts to convert human-readable texts into Semantic Web system would cost $10,000 per page, which still does not take into account the different interpretation of the data.
Through utilizing the inherit link structure of Wikipedia texts, nature context of knowledge use can be formed. Take an example, if you enter "Egypt" into Wikipedia, vast amount of articles relating to Egypt. After working on the texts, we can immediately know that "Egypt" is related to "Cairo", "Alexandria" with the relation of "city" to "Egypt". Tracing the links to "Cairo" will invoke the history and information related to its population and other statistic appeared. Extrapolating these linkages would obtain a nature grouping of all background knowledge relating to Egypt. In other words, Wikipedia can be regarded as many segment of knowledge uses joint together into a huge knowledge network.
So what's up to Semantic Web? Following this advanced link structure and contents, with our natural language processing techniques which automatically learn the language syntax and semantic, the data can be annotated and processed automatically into the Semantic web format. Comparing with the previous proposal of Semantic Web which involves subjective annotation of the articles with respect to the contents, our method is more objective in which no third parties are involved in adding meta data and tags. Instead the metadata and tags are growth from the original texts which deliver the original intentions of the authors. This further improves the credibility of the system as the tagging decision is made based on the group of intentions of the authors and thus can be explained more easily. This also correlates with collaborative principles of Wikipedia where consensus instead of authoritative is involved in judging the validity of the contents and texts.
This demo prototype is further utilized to connect the external texts outside Wikipedia. With the large number of reference URL to the outside World Wide Web. The Semantic Web content generated can immediately connect with the content outside which further enriches the overall knowledge base and scope of the Semantic Web content. This helps to bring more practical usability of the content to the outsider.
The second part of presentation is about information and knowledge access. We will demonstrate the usefulness of the induced knowledge to other applications. One particular application is the visualization of knowledge of information. The predominant mode of accessing information is through the search and browse interface as established in the early day of the web. Browsing involves arranging the knowledge and information into hierarchy and users access the contents through the hierarchy. Searching involves typing whatever keywords to the search interface, and the pray, and then hope the relevant results will come out. We develop a prototype to integrate this mode together. Users are free to type search keywords into the interfaces. Instead of showing only text snippets of the targeted contents. Critical contents are marked automatically and show in separate interface along with the text snippets. Users can then judge the usefulness of the information earlier by examining the critical contents which are often the real-world entities. Current search engine which involves blind indexing of the terms cannot achieve this depth of judgment. Through the uses of natural language processing, these terms are more easily identified and help the users to take a much faster path to access valuable information and knowledge.
In summary, this presentation and talk would explain and contribute to the Wikipedian communities by demonstrating the potentials in utilizing Wikipedia as a more wider access of information and knowledge, not just from users, but also from Intelligent applications. In addition, the new way of visualizing information through the application of Natural Language Processing and Learning techniques would help to make the information more easily accessible to information users and seekers. Through the demonstration and enlightening talks, the potentials of using Wikipedia to further information task can be demonstrated and trigger the audience to think more deeply about the real potential of Wikipedia project.
Some of the core technique is rather technical, but we decide to present the prototype and project into a much wider range of audience through stressing on the information access and Semantic web application so as to increase the general audience's interest and accessibility of the materials. We will post the hyperlink for testing this prototype in the later release of public materials if the presentation proposal is accepted.

