Difference between revisions of "User:Econterms/Report from WikiSym/OpenSym 2013"

From Wikimedia District of Columbia
Jump to navigation Jump to search
(Baidu Baike background)
(retitled)
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
{{TOCright}}
== Draft blog about WikiSym/OpenSym 2013 ==
 
 
WikiSym is an annual conference on academic research about wikis and other kinds of open collaboration. As in past years some of the research is fascinating. This time I happily identified myself as a member of Wikimedia DC on my name bad. Here are some topics and findings I found interesting. Most of the full papers are linked from the conference proceedings, [http://opensym.org/wsos2013/program/proceedings online here].
 
WikiSym is an annual conference on academic research about wikis and other kinds of open collaboration. As in past years some of the research is fascinating. This time I happily identified myself as a member of Wikimedia DC on my name bad. Here are some topics and findings I found interesting. Most of the full papers are linked from the conference proceedings, [http://opensym.org/wsos2013/program/proceedings online here].
   
; Sources referred to by online encyclopedias
+
== Sources referred to by online encyclopedias ==
 
* We saw an analysis of sources cited in English Wikipedia in footnotes. Scholarly publications are cited less than in a traditional encyclopedia. Large fractions of references are to primary sources; and to from "alternative" publishers, governments, and nonprofits. They commented on global South geography. Heather Ford, David R. Musicant, Shilad Sen, Nathaniel Miller: ([http://opensym.org/wsos2013/proceedings/p0203-ford.pdf the paper, online])
 
* We saw an analysis of sources cited in English Wikipedia in footnotes. Scholarly publications are cited less than in a traditional encyclopedia. Large fractions of references are to primary sources; and to from "alternative" publishers, governments, and nonprofits. They commented on global South geography. Heather Ford, David R. Musicant, Shilad Sen, Nathaniel Miller: ([http://opensym.org/wsos2013/proceedings/p0203-ford.pdf the paper, online])
* The giant Chinese onine wiki-ish encyclopedia, [http://en.wikipedia.org/wiki/Baidu_Baike Baiku Baide], is interestingly similar to the Chinese Wikipedia, and there are a spectrum of differences, e.g. that different contents from them have been blocked, censored, and removed at different times. Han-Ten Liao is writing a dissertation comparing the two. He showed tables of what sources they cited. BB seems to include a lot of text copied from Wikipedia. Both have a lot of copyright violations. Here is an abstract: [http://opensym.org/wsos2013/proceedings/p0601-liao.pdf] I Earlier findings from these comparisons: [http://people.oii.ox.ac.uk/hanteng/2011/09/04/data-selection-and-analytical-strategies-comparing-the-content-of-baidu-baike-and-chinese-wikipedia/#more-287] and [http://people.oii.ox.ac.uk/hanteng/2011/09/04/data-selection-and-analytical-strategies-comparing-the-content-of-baidu-baike-and-chinese-wikipedia-22/#more-295]. Inspired by learning about BB, I asked students in our Wikimania-arranged dorm about the two and got interesting and different answers. This is a rich subject.
+
* The giant Chinese online encyclopedia, [http://en.wikipedia.org/wiki/Baidu_Baike Baiku Baide], is interestingly similar to the Chinese Wikipedia, and there are a spectrum of differences, e.g. that submissions to Baidu Baike are reviewed by Baidu employees before they appear in it, and that different contents from the two sites have been blocked, censored, and removed at different times. Han-Ten Liao is writing a dissertation comparing the two. He showed tables of what sources they cited. BB seems to include a lot of text copied from Wikipedia. Both have a lot of copyright violations. Here is an abstract: [http://opensym.org/wsos2013/proceedings/p0601-liao.pdf] I Earlier findings from these comparisons: [http://people.oii.ox.ac.uk/hanteng/2011/09/04/data-selection-and-analytical-strategies-comparing-the-content-of-baidu-baike-and-chinese-wikipedia/#more-287] and [http://people.oii.ox.ac.uk/hanteng/2011/09/04/data-selection-and-analytical-strategies-comparing-the-content-of-baidu-baike-and-chinese-wikipedia-22/#more-295]. Inspired by learning about BB, I asked students in our Wikimania-arranged dorm about the two and got interesting and different answers. This is a rich subject. Chinese wikipedians often declare openly on users pages their political leaning, from procommunist to profalun gong to han supremacist to turkmen-independence favoring. BB has no frequent editors from Hong Kong or Taiwas, whereas CW has a lot. BB uses one simplified character set, (I believe) whereas CW allows several character sets. He listed the most-cited web sites by each. many "book review" sites and spam web sites. ("book review" seems to mean a kind of spam site; not clear.) CW cites "bioinfo.cn" with very high frequency. oh, that's biology, not biographical BB seems to be much larger based on numbers of citations.
  +
  +
== Open access, data management, and governance ==
 
* Think about what a version control system for datasets should be like. It's different from a source code version control system because for example data sets may be very large and may change in so many places from version to version that they are too hard to compare realistically. [http://opensym.org/wsos2013/proceedings/p0301-sowe.pdf Sowe and Zettsu] implemented a way of "curating" data sets with a wiki that points ot the data. Here "data curation" means collecting, tending, organizing validating, annotating and preserving data for reuse and sharing. They implement their "model" on a MediaWiki in which a description of the data ("metadata") are on the wiki, and it links to the data itself, and the individuals doing this have wiki-histories and reputations. They've implemented this for their laboratory's disaster-response research which can use diverse kinds of data sets ; weather, industry, geospatial, satellite, population, media, and others.
   
; Data versioning and open access
 
* Think about what a version control system for datasets should be like. It's different from a source code version control system because for example data sets may be very large and may change in so many places from version to version that they are too hard to compare realistically. [http://opensym.org/wsos2013/proceedings/p0301-sowe.pdf Sowe and Zettsu] implemented a way of "curating" data sets with a wiki that points ot the data. They have a "data curation model" implemented on a MediaWiki in which a description of the data ("metadata") are on the wiki, and it links to the data itself, and the individuals doing this have wiki-histories and reputations.
 
 
* Computational biologist Philip Bourne spoke in a plenary session on the challenges of open science. experiment making a PLOS publication that also went right to wikipedia. Discussed how a scientific paper could or should be associated with easy access to its data and executable versions of its statistical analysis and graphs. This subject came up other times at the conference. It implies a set of steps beyond open data toward open and reusable data and analysis. We're not close to making this easy to implement; it's a bit like making a movie for each scientific paper, which also includes its footnotes. He is the co-founder and founding Editor-in- Chief of the open access journal PLOS Computational Biology which, he said, is publishing 30,000 articles this year and is by this measure the largest academic journal in the world. ([http://www.slideshare.net/mobile/pebourne/wiki-symopensym2013 Bourne's slides])
 
* Computational biologist Philip Bourne spoke in a plenary session on the challenges of open science. experiment making a PLOS publication that also went right to wikipedia. Discussed how a scientific paper could or should be associated with easy access to its data and executable versions of its statistical analysis and graphs. This subject came up other times at the conference. It implies a set of steps beyond open data toward open and reusable data and analysis. We're not close to making this easy to implement; it's a bit like making a movie for each scientific paper, which also includes its footnotes. He is the co-founder and founding Editor-in- Chief of the open access journal PLOS Computational Biology which, he said, is publishing 30,000 articles this year and is by this measure the largest academic journal in the world. ([http://www.slideshare.net/mobile/pebourne/wiki-symopensym2013 Bourne's slides])
   
  +
* Beat Estermann reported on a survey of GLAM institutions in the German part of Switzerland, and their roles in "open data". They got back questionnaires from 72 such institutions, with a lot of detail on their knowledge or action related to five open-data activities or technologies, e.g. whether they put scans/photos of their heritage objects online, and with what restrictions, and on whether they are posting open data according to various standards. Respondents perceived the greatest risks or shortcoming of open-data and crowdsourcing to be the extra time, effort and expenses it would take, and were not concerned about losses of revenue because they earn little from sales of image rights or lending fees.
; "Reverts" on Wikipedia -- these are edits on wikipedia that undo a string of previous edits
 
  +
* Geiger and Halfaker analyze the sources of "reverts" on Wikipeda -- . Most reverts are designed to maintain quality against vandalism and errors. The authors show that ClueBotNG is the quickest and most active mechanism -- usually acting against vandalism within 20 seconds if it will act at all -- and discuss the spectrum of other bots and tools and human behaviors that cause reverts. ClueBotNG was down several times for days in 2011, and they analyze how many reverts occurred in those periods. They conclude in essence that the same quality control was exercised in those periods, but more slowly, and they discuss how slowly. http://opensym.org/wsos2013/proceedings/p0200-geiger.pdf
 
  +
== Licensing -- free and Creative Commons ==
  +
  +
* Australian law professor Anne Fitzgerald explained recently adopted licensing rules for the data and publications from Australian government's statistics and geography agencies. After careful review she and others on a committed recommended against adopting a public-domain rule (like the U.S. government's) and in favor of a Creative Commons noncommercial attribution copyright (CC BY NC). If I understood correctly, this was desirable to help the government control commercial sale of its publications. Open access and copyrights on government work were actively discussed and debated at WikiSym and Wikimania too. For more, see [http://www.oaic.gov.au/information-policy/information-policy-resources/information-policy-agency-resources/principles-on-open-public-sector-information here].
  +
  +
* Creative Commons licenses, such as those for Wikipedia content, don't apply the same way in all countries. These are licenses (giving permission) with "some rights reserved", and have these key attributes: "BY" means attribution ; NC is noncommercial, SA means Share Alike, and ND means no-derivatives. It is not clear that users can waive all the implied rights in a license clause electronically in some countries (such as the UK, India, Hong Kong, and NZ), without really "signing" somehow. There is ambiguity about how much an adaptation creates a derived work or a new work in its own right: If work A is used in work B, then work B is adapted into work C but contains nothing left of work A, does C have to attribute anything to A? Does A's license apply to C still? If work A is licensed only non-commercially, can work C be sold?
  +
:And how much does any of this depend on the jurisdiction of the computer server, versus the jurisdiction of the actors? For these and other head-scratchers, see [http://opensym.org/wsos2013/proceedings/p0401-mysoor.pdf Poorna Mysoor's paper]. Version 4, draft 3 of the CC licenses is under development, and I gather they are trying to internationalize them so the text looks and works the same across countries.
  +
  +
== Undone edits ("reverts") on Wikipedia ==
 
* A "revert" is a wiki edit on that undo a string of previous edits by somebody else. Most reverts are intended to maintain quality against vandalism and errors. [http://opensym.org/wsos2013/proceedings/p0200-geiger.pdf Geiger and Halfaker] analyzed the sources of reverts on Wikipedia. The authors show that ClueBotNG is the quickest and most active mechanism -- usually acting against vandalism within 20 seconds if it will act at all -- and discuss the spectrum of other bots and tools and human behaviors that cause reverts. ClueBotNG was down several times for days in 2011, and they analyze how many reverts occurred in those periods. They conclude in essence that the same quality control was exercised in those periods, but more slowly, and they discuss how slowly.
  +
  +
There are a variety of software mechanisms by which a revert can happen: manual, cyborg, bot, batch ; these are distinct. In this paper, time-to-revertis measured for several tools and approaches. How long does it take for humans, cyborgs, and bots to revert edits? Here are some technology specifics:
  +
:* ClueBotNG is very quick, detecting apparent vandalism -- reverts between 2500-5000 edit/day ; it's made 2m edits ; it makes 13.7% of all reverts. ClueBot makes its decisions within 5 secs.
  +
:* Huggle is a cyborg tool, meaning a human is making the decisions; it doesn't run in a browser but rather its own interface. It takes longer than ClueBotNG.
  +
:* Sticki, developed by Andrew West is a similar more sophisticated cyborg, also slower than ClueBotNG. On either of these, simple buttons enable revert of one edit or many by one person.
  +
:* A bot called XLinkBot finds recent changes which cite, for example, Facebook, and these edits are removed pretty systematically although sometimes they are justified. XLinkBot takes a minute or two to make a revert.
  +
* DumbBOT removes templates that are one week old so it has a peak then.
  +
They created a histogram of time-to-revert during a normal month, Jan 2011. Huge spike at 1-20secs for bots. big pile for in-browser human users. Then in between is a pile for cyborgs I guess. Not quite clear. they made empirical probability density graphs for these editing/reverting techs.
  +
  +
In 2011 ClueBotNG went down for four distinct periods e.g. Feb 15-18, Mar 13-17, Mar 29-Apr 7; and one other. The researchers explored how the temporal distribution of revert activity changed. Overall it seemed that the same functions were performed although the pace of edit-reverts declined by about half.
  +
  +
says ClueBotNG depends on the "new changes feed" so it could not recover from its own downtime by getting old stuff from the queue. it just got restarted.
  +
  +
* One research project showed that the likelihood that an edit will be reverted can be predicted pretty well by how well the categories the editor has edited match the categories of the article, even if one leaves out what the edit itself looks like. ([http://opensym.org/wsos2013/proceedings/p0205-segall.pdf Paper by Segall and Greenstadt]). " We also com- pare the IllEdit system to ClueBot NG, a leader in automatic Wikipedia vandalism detection, and discuss the utility of both algorithms working in parallel. It is important to note that the IllEdit algorithm is meant to detect reverted edits, not gross vandalism. Vandalism is a subclass of reverted edits that exhibits a willful misrepresentation of informa- tion or defacement. We present an algorithm that can be used to detect edits that are reverted based on accidental misinformation as well."

Latest revision as of 01:46, 22 August 2013

WikiSym is an annual conference on academic research about wikis and other kinds of open collaboration. As in past years some of the research is fascinating. This time I happily identified myself as a member of Wikimedia DC on my name bad. Here are some topics and findings I found interesting. Most of the full papers are linked from the conference proceedings, online here.

Sources referred to by online encyclopedias

  • We saw an analysis of sources cited in English Wikipedia in footnotes. Scholarly publications are cited less than in a traditional encyclopedia. Large fractions of references are to primary sources; and to from "alternative" publishers, governments, and nonprofits. They commented on global South geography. Heather Ford, David R. Musicant, Shilad Sen, Nathaniel Miller: (the paper, online)
  • The giant Chinese online encyclopedia, Baiku Baide, is interestingly similar to the Chinese Wikipedia, and there are a spectrum of differences, e.g. that submissions to Baidu Baike are reviewed by Baidu employees before they appear in it, and that different contents from the two sites have been blocked, censored, and removed at different times. Han-Ten Liao is writing a dissertation comparing the two. He showed tables of what sources they cited. BB seems to include a lot of text copied from Wikipedia. Both have a lot of copyright violations. Here is an abstract: [1] I Earlier findings from these comparisons: [2] and [3]. Inspired by learning about BB, I asked students in our Wikimania-arranged dorm about the two and got interesting and different answers. This is a rich subject. Chinese wikipedians often declare openly on users pages their political leaning, from procommunist to profalun gong to han supremacist to turkmen-independence favoring. BB has no frequent editors from Hong Kong or Taiwas, whereas CW has a lot. BB uses one simplified character set, (I believe) whereas CW allows several character sets. He listed the most-cited web sites by each. many "book review" sites and spam web sites. ("book review" seems to mean a kind of spam site; not clear.) CW cites "bioinfo.cn" with very high frequency. oh, that's biology, not biographical BB seems to be much larger based on numbers of citations.

Open access, data management, and governance

  • Think about what a version control system for datasets should be like. It's different from a source code version control system because for example data sets may be very large and may change in so many places from version to version that they are too hard to compare realistically. Sowe and Zettsu implemented a way of "curating" data sets with a wiki that points ot the data. Here "data curation" means collecting, tending, organizing validating, annotating and preserving data for reuse and sharing. They implement their "model" on a MediaWiki in which a description of the data ("metadata") are on the wiki, and it links to the data itself, and the individuals doing this have wiki-histories and reputations. They've implemented this for their laboratory's disaster-response research which can use diverse kinds of data sets ; weather, industry, geospatial, satellite, population, media, and others.
  • Computational biologist Philip Bourne spoke in a plenary session on the challenges of open science. experiment making a PLOS publication that also went right to wikipedia. Discussed how a scientific paper could or should be associated with easy access to its data and executable versions of its statistical analysis and graphs. This subject came up other times at the conference. It implies a set of steps beyond open data toward open and reusable data and analysis. We're not close to making this easy to implement; it's a bit like making a movie for each scientific paper, which also includes its footnotes. He is the co-founder and founding Editor-in- Chief of the open access journal PLOS Computational Biology which, he said, is publishing 30,000 articles this year and is by this measure the largest academic journal in the world. (Bourne's slides)
  • Beat Estermann reported on a survey of GLAM institutions in the German part of Switzerland, and their roles in "open data". They got back questionnaires from 72 such institutions, with a lot of detail on their knowledge or action related to five open-data activities or technologies, e.g. whether they put scans/photos of their heritage objects online, and with what restrictions, and on whether they are posting open data according to various standards. Respondents perceived the greatest risks or shortcoming of open-data and crowdsourcing to be the extra time, effort and expenses it would take, and were not concerned about losses of revenue because they earn little from sales of image rights or lending fees.

Licensing -- free and Creative Commons

  • Australian law professor Anne Fitzgerald explained recently adopted licensing rules for the data and publications from Australian government's statistics and geography agencies. After careful review she and others on a committed recommended against adopting a public-domain rule (like the U.S. government's) and in favor of a Creative Commons noncommercial attribution copyright (CC BY NC). If I understood correctly, this was desirable to help the government control commercial sale of its publications. Open access and copyrights on government work were actively discussed and debated at WikiSym and Wikimania too. For more, see here.
  • Creative Commons licenses, such as those for Wikipedia content, don't apply the same way in all countries. These are licenses (giving permission) with "some rights reserved", and have these key attributes: "BY" means attribution ; NC is noncommercial, SA means Share Alike, and ND means no-derivatives. It is not clear that users can waive all the implied rights in a license clause electronically in some countries (such as the UK, India, Hong Kong, and NZ), without really "signing" somehow. There is ambiguity about how much an adaptation creates a derived work or a new work in its own right: If work A is used in work B, then work B is adapted into work C but contains nothing left of work A, does C have to attribute anything to A? Does A's license apply to C still? If work A is licensed only non-commercially, can work C be sold?
And how much does any of this depend on the jurisdiction of the computer server, versus the jurisdiction of the actors? For these and other head-scratchers, see Poorna Mysoor's paper. Version 4, draft 3 of the CC licenses is under development, and I gather they are trying to internationalize them so the text looks and works the same across countries.

Undone edits ("reverts") on Wikipedia

  • A "revert" is a wiki edit on that undo a string of previous edits by somebody else. Most reverts are intended to maintain quality against vandalism and errors. Geiger and Halfaker analyzed the sources of reverts on Wikipedia. The authors show that ClueBotNG is the quickest and most active mechanism -- usually acting against vandalism within 20 seconds if it will act at all -- and discuss the spectrum of other bots and tools and human behaviors that cause reverts. ClueBotNG was down several times for days in 2011, and they analyze how many reverts occurred in those periods. They conclude in essence that the same quality control was exercised in those periods, but more slowly, and they discuss how slowly.

There are a variety of software mechanisms by which a revert can happen: manual, cyborg, bot, batch ; these are distinct. In this paper, time-to-revertis measured for several tools and approaches. How long does it take for humans, cyborgs, and bots to revert edits? Here are some technology specifics:

  • ClueBotNG is very quick, detecting apparent vandalism -- reverts between 2500-5000 edit/day ; it's made 2m edits ; it makes 13.7% of all reverts. ClueBot makes its decisions within 5 secs.
  • Huggle is a cyborg tool, meaning a human is making the decisions; it doesn't run in a browser but rather its own interface. It takes longer than ClueBotNG.
  • Sticki, developed by Andrew West is a similar more sophisticated cyborg, also slower than ClueBotNG. On either of these, simple buttons enable revert of one edit or many by one person.
  • A bot called XLinkBot finds recent changes which cite, for example, Facebook, and these edits are removed pretty systematically although sometimes they are justified. XLinkBot takes a minute or two to make a revert.
  • DumbBOT removes templates that are one week old so it has a peak then.

They created a histogram of time-to-revert during a normal month, Jan 2011. Huge spike at 1-20secs for bots. big pile for in-browser human users. Then in between is a pile for cyborgs I guess. Not quite clear. they made empirical probability density graphs for these editing/reverting techs.

In 2011 ClueBotNG went down for four distinct periods e.g. Feb 15-18, Mar 13-17, Mar 29-Apr 7; and one other. The researchers explored how the temporal distribution of revert activity changed. Overall it seemed that the same functions were performed although the pace of edit-reverts declined by about half.

says ClueBotNG depends on the "new changes feed" so it could not recover from its own downtime by getting old stuff from the queue. it just got restarted.

  • One research project showed that the likelihood that an edit will be reverted can be predicted pretty well by how well the categories the editor has edited match the categories of the article, even if one leaves out what the edit itself looks like. (Paper by Segall and Greenstadt). " We also com- pare the IllEdit system to ClueBot NG, a leader in automatic Wikipedia vandalism detection, and discuss the utility of both algorithms working in parallel. It is important to note that the IllEdit algorithm is meant to detect reverted edits, not gross vandalism. Vandalism is a subclass of reverted edits that exhibits a willful misrepresentation of informa- tion or defacement. We present an algorithm that can be used to detect edits that are reverted based on accidental misinformation as well."