User:Econterms/Report from WikiSym/OpenSym 2013

From Wikimedia District of Columbia
Jump to navigation Jump to search

Draft blog about WikiSym/OpenSym 2013

WikiSym is an annual conference on academic research about wikis and other kinds of open collaboration. As in past years some of the research is fascinating. This time I happily identified myself as a member of Wikimedia DC on my name bad. Here are some topics and findings I found interesting. Most of the full papers are linked from the conference proceedings, online here.

Sources referred to by online encyclopedias
  • We saw an analysis of sources cited in English Wikipedia in footnotes. Scholarly publications are cited less than in a traditional encyclopedia. Large fractions of references are to primary sources; and to from "alternative" publishers, governments, and nonprofits. They commented on global South geography. Heather Ford, David R. Musicant, Shilad Sen, Nathaniel Miller: (the paper, online)
  • The giant Chinese online encyclopedia, Baiku Baide, is interestingly similar to the Chinese Wikipedia, and there are a spectrum of differences, e.g. that submissions to Baidu Baike are reviewed by Baidu employees before they appear in it, and that different contents from the two sites have been blocked, censored, and removed at different times. Han-Ten Liao is writing a dissertation comparing the two. He showed tables of what sources they cited. BB seems to include a lot of text copied from Wikipedia. Both have a lot of copyright violations. Here is an abstract: [1] I Earlier findings from these comparisons: [2] and [3]. Inspired by learning about BB, I asked students in our Wikimania-arranged dorm about the two and got interesting and different answers. This is a rich subject. Chinese wikipedians often declare openly on users pages their political leaning, from procommunist to profalun gong to han supremacist to turkmen-independence favoring. BB has no frequent editors from Hong Kong or Taiwas, whereas CW has a lot. BB uses one simplified character set, (I believe) whereas CW allows several character sets. He listed the most-cited web sites by each. many "book review" sites and spam web sites. ("book review" seems to mean a kind of spam site; not clear.) CW cites "bioinfo.cn" with very high frequency. oh, that's biology, not biographical BB seems to be much larger based on numbers of citations.
Open access, data management, and governance
  • Think about what a version control system for datasets should be like. It's different from a source code version control system because for example data sets may be very large and may change in so many places from version to version that they are too hard to compare realistically. Sowe and Zettsu implemented a way of "curating" data sets with a wiki that points ot the data. Here "data curation" means collecting, tending, organizing validating, annotating and preserving data for reuse and sharing. They implement their "model" on a MediaWiki in which a description of the data ("metadata") are on the wiki, and it links to the data itself, and the individuals doing this have wiki-histories and reputations. They've implemented this for their laboratory's disaster-response research which can use diverse kinds of data sets ; weather, industry, geospatial, satellite, population, media, and others.
  • Computational biologist Philip Bourne spoke in a plenary session on the challenges of open science. experiment making a PLOS publication that also went right to wikipedia. Discussed how a scientific paper could or should be associated with easy access to its data and executable versions of its statistical analysis and graphs. This subject came up other times at the conference. It implies a set of steps beyond open data toward open and reusable data and analysis. We're not close to making this easy to implement; it's a bit like making a movie for each scientific paper, which also includes its footnotes. He is the co-founder and founding Editor-in- Chief of the open access journal PLOS Computational Biology which, he said, is publishing 30,000 articles this year and is by this measure the largest academic journal in the world. (Bourne's slides)
  • Australian law professor Anne Fitzgerald explained recently adopted licensing rules for the data and publications from Australian government's statistics and geography agencies. After careful review she and others on a committed recommended against adopting a public-domain rule (like the U.S. government's) and in favor of a Creative Commons noncommercial attribution copyright (CC BY NC). If I understood correctly, this was desirable to help the government control commercial sale of its publications. Open access and copyrights on government work were actively discussed and debated at WikiSym and Wikimania too. For more follow up here.
  • Beat Estermann reported on a survey of GLAM institutions in the German part of Switzerland, and their roles in "open data". They got back questionnaires from 72 such institutions, with a lot of detail on their knowledge or action related to five open-data activities or technologies, e.g. whether they put scans/photos of their heritage objects online, and with what restrictions, and on whether they are posting open data according to various standards. Respondents perceived the greatest risks or shortcoming of open-data and crowdsourcing to be the extra time, effort and expenses it would take, and were not concerned about losses of revenue because they earn little from sales of image rights or lending fees.
  • Creative Commons licenses, such as those for Wikipedia content, don't apply the same way in all countries. These are licenses (giving permission) with "some rights reserved", and have these key attributes: "BY" means attribution ; NC is noncommercial, SA means Share Alike, and ND means no-derivatives. It is not clear that users can waive all the implied rights in a license clause electronically in some countries (such as the UK, India, Hong Kong, and NZ), without really "signing" somehow. There is ambiguity about how much an adaptation creates a derived work or a new work in its own right: If work A is used in work B, then work B is adapted into work C but contains nothing left of work A, does C have to attribute anything to A? Does A's license apply to C still? If work A is licensed only non-commercially, can work C be sold?

How much does any of this depend on the jurisdiction of the computer server, versus the jurisdiction of the actors? For these and other head-scratchers, see Poorna Mysoor's paper. Version 4, draft 3 of the CC licenses is under development, and tI gather hey are trying to internationalize them so the text looks and works the same across countries.

Undone edits ("reverts") on Wikipedia
  • A "revert" is a wiki edit on that undo a string of previous edits by somebody else. Most reverts are intended to maintain quality against vandalism and errors. Geiger and Halfaker analyzed the sources of reverts on Wikipedia. The authors show that ClueBotNG is the quickest and most active mechanism -- usually acting against vandalism within 20 seconds if it will act at all -- and discuss the spectrum of other bots and tools and human behaviors that cause reverts. ClueBotNG was down several times for days in 2011, and they analyze how many reverts occurred in those periods. They conclude in essence that the same quality control was exercised in those periods, but more slowly, and they discuss how slowly.

There are a variety of software mechanisms by which a revert can happen: manual, cyborg, bot, batch ; these are distinct. Huggle is a cyborg tool, meaning a human is making the decisions; it doesn't run in a browser but rather its own interface. Sticki, developed by Andrew West is a similar more sophisticated cyborg too. On either of these, simple buttons enable revert of one edit or many by one person A bot called XLinkBot -- finds recent changes which cite, for example, Facebook, and these edits are removed pretty systematically although sometimes they are justified ClueBotNG is very quick, detecting apparent vandalism -- reverts between 2500-5000 edit/day ; it's made 2m edits ; it makes 13.7% of all reverts. "Without knowing of such non-human actors at work, it may seem unfathomable that such coordination against vandals could even be possible." (Geiger and Ribes 2010) They use Time-to-revert for measuring the effectiveness of quality control in Wikipedia. cites several authors. Time-to-revert: imagine editor 1 adds a new section at 8:01am, editor 2 blanks the page at 8:06, and editor 3 reverts back to the version by editor 1 at 8:11am. Then time-to-revert is 6mins. How long does it take for humans, cyborgs, and bots to revert edits? They created a histogram of time-to-revert during a normal month, Jan 2011. Huge spike at 1-20secs for bots. big pile for in-browser human users. Then in between is a pile for cyborgs I guess. Not quite clear. they made empirical probability density graphs for these editing/reverting techs. ClueBot makes its decisions within 5 secs ; XLinkBot takes a minute or two. "cyborgs" like Huggle usu take longer. Sticki is also a bit slower and uses sophisticated heuristics. DumbBOT removes templates that are one week old so it has a peak then In 2011 ClueBotNG went down for four distinct periods e.g. Feb 15-18, Mar 13-17, Mar 29-Apr 7; and one other. They explore: Did temporal distribution of revert activity change? oddly, no, it seems like the system continues with the same functions being performed although the time for reversion doubled. show graphs to the econ historians or a summary of it **** says ClueBotNG depends on the "new changes feed" so it could not recover from its own downtime by getting old stuff from the queue. it just got restarted.