More

emw · on May 21, 2024

Wikipedia indeed seems the most valuable for ML, by far. Wikidata, Wikimedia Commons, and Wiktionary also seem useful there.

resolutebat · on May 22, 2024

Wikivoyage is underrated and that was not helped by the acrimonious split with Wikitravel (which was acquired a predatory marketing company), but it finally seems to be pulling ahead.

hehdhdjehehegwv · on May 22, 2024

One of my favorite LLM applications is getting them to write wikidata queries. The data is amazing, but the query language is nothing but pure hell.

emw · on May 21, 2024

> I wouldn't be surprised if some Wikipedia editors balk at their volunteer work being actively marketed and reformatted for ease of LLM training

As someone who avidly edited Wikipedia for 6-8 years, I am happy to see my volunteer work used for LLM training. I also agree some other editors likely aren't.

lambdaone · on May 22, 2024

Given that all Wikipedia editors have explicitly consented to their content being released under the Creative Commons Attribution-ShareAlike 4.0 License, they don't get a choice about their content being used for any purpose.

Redistribution of content is an entirely different matter, and the legal status of copyrighted material in relation to LLM training is an open issue that is currently the subject of litigation.

emw · on May 22, 2024

Wikimedia Foundation’s perspective on this [1]:

> "it is important to note that Creative Commons licenses allow for free reproduction and reuse, so AI programs like ChatGPT might copy text from a Wikipedia article or an image from Wikimedia Commons. However, it is not clear yet whether massively copying content from these sources may result in a violation of the Creative Commons license if attribution is not granted. Overall, it is more likely than not if current precedent holds that training systems on copyrighted data will be covered by fair use in the United States, but there is significant uncertainty at time of writing."

The new Wikimedia Enterprise APIs facilitate attribution. For example, the "api.enterprise.wikimedia.com/v2/structured-contents/{name}" response [2] includes an "editor" object in a "version" object. So the Wikipedia editor who most recently edited the article seems quite feasible to attribute. ML apps could incorporate such attribution in their offering, and help satisfy the "BY" clause in the underlying CC-BY-SA 4.0 license for Wikipedia content.

---

1. https://meta.wikimedia.org/wiki/Wikilegal/Copyright_Analysis...

2. https://enterprise.wikimedia.com/docs/on-demand/#article-str...

ZunarJ5 · on May 22, 2024

As another editor, I think they might be a vocal minority. :)

emw · on Dec 13, 2019

> if you want to have really structured and semi-reliable information you will probably have to rely, at some point, on something like Wikipedia meta-information (DBpedia).

Wikidata is also worth considering for that task. It is:

* Directly linked from Wikipedia [1]

* The data source for many infoboxes [2]

* Seeded with data from Wikipedia

* More active and integrated in community

* Larger in total number of concepts

Wikidata also has initiatives in lexicographic data [3] and images [4, 5].

On the subject of Cyc: the CycL "generalization" (#$genls) predicate inspired Wikidata's "subclass of" property [6], which now links together Wikidata's tree of knowledge.

---

1. See "Wikidata" link at left in all articles, e.g. https://en.wikipedia.org/wiki/Knowledge_base

2. https://en.wikipedia.org/wiki/Category:Infobox_templates_usi...

3. https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/...

4. https://www.wikidata.org/wiki/Wikidata:Wikimedia_Commons/Dev...

5. See "Structured data" tab in image details on Wikimedia Commons, e.g. https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonar...

6. https://www.wikidata.org/wiki/Property_talk:P279#Archived_cr...

emw · on Feb 7, 2016

Gage's work is probably not prohibitively expensive in terms of money. A few plane tickets to Iowa and New Hampshire every four years, lodging in each for maybe a few days. He also attends Comic-Con every year. He has a good DSLR and lens kit, but that's a one time cost. If he had a paid summer internship in accounting, a side job during the academic year, and supportive parents this all seems doable for a middle class college student attending a public university with in-state tuition.

The bigger cost in Gage's photography is almost certainly time. He's uploaded almost 40,000 photos to Flickr since December 2007 -- roughly 5,700 photos a year. Even with the one-person-many-angles style seen in his Flickr stream at https://www.flickr.com/photos/gageskidmore/with/24744613302/, that's a lot. He also captions his photos, e.g. https://www.flickr.com/photos/gageskidmore/24744613302/.

The vast reach of his photography is likely a sufficient incentive for Gage to invest all that time and significant-but-not-immense money. I hope Gage continues his excellent work.

x43b · on Feb 7, 2016

>>Gage's work is probably not prohibitively expensive in terms of money.

"between states to more than 40 speaking engagements." "I traveled to nearly every part of the country to cover his political events,” "Skidmore is hot on the campaign trail again, toggling his time between New Hampshire, Iowa, and Arizona"

Prohibitively expensive is a relative term, but it is hard to imagine this not costing 10s of thousands of dollars and being out of the reach of most high school and college students.

pyre · on Feb 7, 2016

Maybe not in this particular case, but you could definitely make a case that photographing political candidates as that pass within range of a day-trip (or maybe a two-day-trip) of where you live is just the cost of time, gas, (food,) and lodging.

Someone could do something similar using such a model.

emw · on Feb 8, 2016

> * it is hard to imagine this not costing 10s of thousands of dollars and being out of the reach of most high school and college students.*

I would be surprised if Gage and his parents have spent more than $10,000 of their personal money on this hobby. Again, cost near that range is certainly significant, but not monetarily immense for a middle class kid with a consuming hobby and supportive parents over the course of 7+ years.

Gage has been frugal in his choice of college, and gets funding from GoFundMe campaigns. He also seems to have had side jobs. Simply choosing to attend a community college and then an in-state public university as Gage has done -- rather than a private university for 4 years -- is probably enough to defray a huge portion of his hobby's cost.

I suspect Gage is also frugal in his means of travel and lodging. A sibling comment mentions the possibility of photographing candidates that come within a day or two trip of home. I imagine that accounts for most of Gage's photography.

Consider this note from [2]: "Skidmore is a 19-year-old student at Glendale Community College in Phoenix and a freelance graphic designer. A Ron Paul supporter, he began photographing politicians when he was living in Terre Haute, Indiana, attending events held by Rand Paul during his successful 2010 Senate run in Kentucky." The drive from Terre Haute, IN to Lexington, KY is about 4 hours. That's completely doable in a day trip. I've driven 4 hours each way in day trips for similar free culture pursuits. It costs about $80 for gas and food.

> "between states to more than 40 speaking engagements."

Travel among multiple US states to attend 40 speaking engagements over the course of 7 years is not necessarily a major financial burden, even for someone Gage's age.

> "I traveled to nearly every part of the country to cover his political events"

"Part" can be pretty general. One could have covered events in Arizona, Iowa, New Hampshire and, say, Virginia and say one has traveled to nearly every part of the country -- the American West, Midwest, Northeast and South.

> "Skidmore is hot on the campaign trail again, toggling his time between New Hampshire, Iowa, and Arizona"

I think it's much more likely that Gage has been to New Hampshire and Iowa each once or twice in the 2016 campaign season, rather than flying out every weekend or so like a high-level political operative or corporate executive from his Arizona State University dorm room.

---

[1] http://priceonomics.com/how-a-college-student-used-creative-...

[2] http://www.niemanlab.org/2012/09/how-a-19-year-old-student-b...

wavefunction · on Feb 7, 2016

Expensive to whom? The average college kid? Don't kid yourself.

emw · on Jan 15, 2016

Wikipedians are hosting free events across the world for "Wikipedia Day" this weekend.

* San Francisco (Saturday): https://en.wikipedia.org/wiki/Wikipedia:Meetup/San_Francisco...

* New York City (Saturday): https://en.wikipedia.org/wiki/Wikipedia:Meetup/NYC/Wikipedia...

* Boston (Saturday): https://meta.wikimedia.org/wiki/Wikipedia_15/Events/Boston

* Bangalore (Sunday): https://meta.wikimedia.org/wiki/Wikipedia_15/Events/Bangalor...

* London (Sunday): https://meta.wikimedia.org/wiki/Meetup/London/101

* Portland, Seattle, Vancouver (Saturday, meet Ward Cunnigham!): https://meta.wikimedia.org/wiki/Wikipedia_15/Events/West_Coa...

New York will feature a talk about Wikidata, how to query it with SPARQL, and how we are integrating it with Wikipedia and pushing forward the Semantic Web. Other NYC talks include things like "Git-flow approach to collaborative editing", "Copyright and plot summaries", and "Automated prevention of spam, vandalism and abuse". We will be linking up with San Francisco and likely some other cities for a global teleconference at 4:00 - 5:00 PM ET (21:00 UTC).

If you're interested, sign up and stop by!

emw · on Oct 25, 2015

The Wikidata taxonomy is basically the successor to Wikipedia's category tree. It not only irons out language-based differences (e.g. the category tree being different among Chinese, Spanish, English, etc. Wikipedias), but also captures the idea of generalization through a more semantically meaningful relation. This Wikidata concept tree is constructed with "subclass of" (P279) [1], a property that expresses the proposition "all instances of these items are also instances of those items". The goal is to have a subsumption hierarchy that classifies all human knowledge.

There's an RDF/OWL export of the Wikidata taxonomy available at [2] as wikidata-taxonomy.nt.gz, which can be explored with Semantic Web browsers like Protege [3].

Another fundamental relation -- "part of" (P361) [4] -- expresses mereological relationships. For (oversimplified) example: "iris part of eye", "eye part of head", "head part of body", etc. Both "subclass of" and "part of" are transitive.

A separate comment of mine in this discussion [5] describes how to traverse the "subclass of" tree in the Wikidata UI and a third-party tool called Wikidata Generic Tree. The same principle applies to the "part of" tree. The latter gets less attention, but is also quite interesting.

---

1. https://www.wikidata.org/wiki/Property:P279

2. http://tools.wmflabs.org/wikidata-exports/rdf/index.php?cont...

3. http://protege.stanford.edu/

4. https://www.wikidata.org/wiki/Property:P361

5. https://news.ycombinator.com/item?id=10448573

igravious · on Oct 25, 2015

Superb response. Deeply informative.

I am very excited about the potential knowledge engineering possibilities opened up by this large structured datasets.

I believe that at the very least we're going to have within a generation a machine-generated ontology to rival Kant and Aristotle. Then we'll have to figure out if this tells us more about how we've digitally organized the knowledge we have or whether it does in fact reveal something about reality and being.

Besides 'subclass of' and 'part of' are there any other taxonomic ways for concepts to relate to other concepts? There are parallels here of course with object-oriented-programming. It's funny, I only within the last year or so started reading up on mereology[0] but as soon as one starts thinking about concepts and there relationships one ends up there eventually. 'part of' is like encapsulation. 'subclass of' is like inheritance. Is there more?

[0] (from the Greek μερος, ‘part’) http://plato.stanford.edu/entries/mereology/

emw · on Oct 25, 2015

Yes, there's also 'instance of' (P31) [1]. Together, 'instance of', 'subclass of' and 'part of' comprise Wikidata's basic membership properties [2].

'Instance of' and 'subclass of' provide Wikidata with a way to express the basic philosophical notion of type-token distinction [3]. For things that are a subclass of something like 'material entity', all instances are physical objects that have a unique location in space and time.

Not all instances are spatiotemporal particulars, though. For example, one might say "Homo sapiens instance of taxon", where taxon is a metaclass, i.e. a class in which the instances are classes. (Here 'taxon' would not be a subclass of 'material entity' -- i.e. taxa are information artifacts, not physical objects.) Support for this kind of "punning" via metamodeling is a major feature of OWL 2 DL [4].

If this sort of thing interests you, definitely take a look into Wikidata [5]. The project will be a sea change for several key features in Wikipedia (e.g. infoboxes), and will likely be a main hub of the Semantic Web.

---

1. https://www.wikidata.org/wiki/Property:P31

2. https://www.wikidata.org/wiki/Help:Basic_membership_properti...

3. http://plato.stanford.edu/entries/types-tokens/

4. http://www.w3.org/TR/owl2-primer/

5. https://www.wikidata.org

igravious · on Oct 26, 2015

Fantastic, I've read through your entire comment history :)

I'm familiar with OWL and RDF. I've been using Sparql and DBPedia, I'll switch to Sparql and Wikidata if you think that's the way to go. How do you see the overlap between DBPedia and Wikidata?

I'm concerned that there's going to be knowledge-grab by corporations and (perhaps) government entities. I fear that the knowledge graphs inside the big G and FB and Yandex and Apple and MS and so on to power their search engines and personal assistants will be orders of magnitude more sophisticated and complex and comprehensive that what will be available to open access research. Witness Freebase. Are my fears misplaced would you say?

I've read that SEP article, I've also read a good bit of Peirce's original journal article. As it says in SEP "It should be mentioned that for Peirce there is actually a trichotomy among types, tokens and tones,[...]" - I think it's amusing that basically everybody ignores the triadic distinction that Peirce claimed to be the case for a dualistic type/term distinction.

I'm looking forward to going through your tutorial quill in hand and pot of ink at the ready.

emw · on Oct 26, 2015

I think SPARQL and Wikidata are the way to go.

Regarding Wikidata and DBpedia: to my understanding the latter gets much of its content by scraping Wikipedia infoboxes. Wikidata will increasingly provide data for those infoboxes, and thus DBpedia.

Regarding your fears: I don't share them. Wikidata will greatly enhance the accessibility of knowledge for open access research.

emw · on Oct 25, 2015

It's also fun doing this on Wikidata with "subclass of".

* Go to a Wikidata item, e.g. "sailboat" [1]

* Click on the "subclass of" (P279) value or, if no such value exists, the "instance of" (P31) value

* Follow the "subclass of" chain up to "entity" [2]

You can do the reverse -- i.e. go down the classification tree -- with tools like Wikidata Generic Tree [3].

---

1. https://www.wikidata.org/wiki/Q1075310

2. https://www.wikidata.org/wiki/Q35120

3. http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120&rp...

emw · on Oct 18, 2015

Wikidata's new SPARQL service is probably the most useful topic in this tutorial for software developers and anyone interested in the Semantic Web. It allows one to query the vast, free knowledgebase that backs Wikipedia -- almost 15 million entities and over 70 million statements.

Example queries:

* Politicians who died of cancer (of any type): https://query.wikidata.org/#PREFIX%20wikibase%3A%20%3Chttp%3...

* Who discovered the most planets? https://query.wikidata.org/#PREFIX%20wikibase%3A%20%3Chttp%3...

* Largest cities with a female mayor: http://query.wikidata.org/#PREFIX%20wikibase%3A%20%3Chttp%3A...

More Wikidata SPARQL query examples: https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Quer....

emw · on Oct 18, 2015

Author here, ask me anything! Slides are also available at http://www.slideshare.net/_Emw/an-ambitious-wikidata-tutoria....

amirouche · on Oct 18, 2015

Are they any open experiments to integrate wikidata with AI systems of any kind?

emw · on Oct 18, 2015

Yes. Kian and WikiBrain are two such projects. Kian is an artificial neural network designed to serve Wikidata, e.g. for classifying humans based on content in Wikipedia [1, 2]. WikiBrain uses Wikidata to recognize the type of relationships or connections between Wikipedia concepts [3, 4].

I suspect larger applications of Wikidata in AI will follow. For example, as of 2010, IBM Watson acquired at least some of its content from DBpedia and YAGO [5], which ultimately derive much of their content from scraping Wikipedia's infoboxes and category system. Now, come 2015, Wikidata is supplying data for some Wikidata infoboxes, and the proportion of infoboxes that pull from structured data in Wikidata will increase over time. And I expect Wikipedia's category system will gradually be supplanted by Wikidata's more expressive property system over time.

Thus, I imagine Wikidata will form a semantic backbone for Q&A systems like Watson in the future.

The Wikidata development team's work is funded through donations by the Allen Institute of Artificial Intelligence, Google, the Gordon and Betty Moore Foundation, and Yandex [6]. So organizations with an interest in AI see potential in Wikidata.

1. https://github.com/Ladsgroup/Kian

2. http://ultimategerardm.blogspot.com/2015/09/wikidata-ten-que...

3. https://github.com/shilad/wikibrain

4. http://conservancy.umn.edu/bitstream/handle/11299/163269/Und...

5. http://www.aaai.org/Magazine/Watson/watson.php

6. http://cacm.acm.org/magazines/2014/10/178785-wikidata/fullte...

Tpt · on Oct 19, 2015

There is also Platypus that is a small query answering engine based on Wikidata: http://askplatyp.us

troymc · on Oct 18, 2015

Is Wikidata only for notable data or any data? If only notable data is allowed, then how is that enforced? Who decides if some piece of data is notable?

Can data be permanently, unrecoverably deleted, or is it more like Wikipedia where you can usually go back to see text that was deleted (in older versions of an article)?

The way that units are being handled is troubling. Is the plan to assign a unique integer to every unit that's ever been used? That's a long list of integers.

emw · on Oct 18, 2015

> Is Wikidata only for notable data or any data?

Wikidata is only for notable data, but the notability threshold is much lower than that for Wikipedia. The criteria for notability are described at [1]. For example, we might add items for all known pathogenic genetic variants, but likely would not have an item for the fire hydrant on your street.

For things not notable enough for Wikidata, interested users could install a local instance of Wikibase [2], the software that runs Wikidata. Wikidata editors and administrators determine what is notable, and have places like [3] to discuss questionable cases.

> Can data be permanently, unrecoverably deleted, or is it more like Wikipedia

Wikidata works like Wikipedia in that regard. Previous versions of a given item or property are almost always viewable (and recoverable) through the History tab [e.g. 4]. In extraordinary cases, like a vandal posting sensitive information about a person, data can be hidden from normal view and/or actually deleted.

> Is the plan to assign a unique integer to every unit that's ever been used?

Basically yes, to my understanding. How many units do you think exist? We already have items for many units, e.g. meter, micrometer, nanometer, foot, yard, bit, byte, , gigabyte, etc. I can see how this implementation might seem naive; e.g. perhaps we could represent one standard metric (like meter or byte) and handle factor conversions of scale (kilometer, millimeter, etc.) through some mechanism we don't have in place right now. Consider also asking about this on the "Contact the Development Team" page [5].

1. https://www.wikidata.org/wiki/Wikidata:Notability

2. http://wikiba.se/

3. https://www.wikidata.org/wiki/Wikidata:Requests_for_deletion...

4. https://www.wikidata.org/w/index.php?title=Q1&action=history

5. https://www.wikidata.org/wiki/Wikidata:Contact_the_developme...

emw · on Sept 24, 2015

It's just 1 order of magnitude, if we're comparing the same language. English Wikipedia has 19,339 articles on philosophy [1]. (Anyone know if there's a resource comparable to SEP in a language other than English?)

If we're talking all philosophy articles in all 291 Wikipedias [2], and (generously) assume that the average Wikipedia has 10% as many philosophy-related articles as English Wikipedia, that's 19,339 * 0.10 * 291 = 562,764 philosophy articles on all Wikipedias. That's 3 orders of magnitude more than SEP's 1,500 articles -- although that's not really comparing apples to apples as we're comparing many languages to one.

1. See Total x Total cell at lower right in matrix at https://en.wikipedia.org/wiki/Category:Philosophy_articles_b...

2. https://en.wikipedia.org/wiki/List_of_Wikipedias