Wikivoyage talk:Syntax checks

Syntax checks
Swept in from pub:

I have found many articles with syntax errors. Sometimes it is just "http" missing in an URL, sometimes it is more tricky. I have split the errors in different categories and uploaded them (WT-en) here so that the whole community can fix them (there are thousands). Thanks everyone, let's make Wikivoyage an even better travel guide! (WT-en) Nicolas1981 03:45, 17 June 2009 (EDT)


 * I've fixed many hundred of the missing http over the past week. The absence of a name doesn't always seem to indicate an invalid listing, although 99/100 it is just totally blank xml.  Once we get through this initial batch of formatting old articles (I estimate I can finish the majority in a week or so) it should just be a matter of checking recent changes every week.  --(WT-en) inas 07:29, 17 June 2009 (EDT)


 * Nice to meet you fellow wiki checker :-) Your script has a very impressive features list! Do you post the output somewhere, or do you fix all of the errors by yourself? I maintain the ASE checker on Wikipedia, and it belongs to a very convenient category called "Active Wiki Fixup Projects". Is there such a category on Wikivoyage? If not, how to publicize such tools? My tool on Wikipedia is successful because many people have found it and spend many hours fixing the errors (I wouldn't have time to process such a huge backlog, thousands of articles requiring case-by-case analysis). My Wikivoyage checks have found about 10000 things to fix. Inas, where can we publicize our projects so that volunteers can come and fix what our scripts have found? Cheers (WT-en) Nicolas1981 01:40, 19 June 2009 (EDT)


 * Nice to meet you also. I think I have fixed over 250 of these errors today, and I think just about all of the http malformed url's should be fixed, as these are very simple form of error (but annoying to users, because they break).
 * I have to confess that I'm not really a wiki checker though. Around a week or so ago, we were discussing adding a few phases that touts always used when adding hotel listings to a blacklist, and I was amazed with just about three phrases I managed to de-tout around 150 articles using a simple substitution pattern.  It seems hotel touts are anything but original in their expression.  Since then, I've got a bit obsessed, and went through another 100 articles, found errors in them, and then added them also as substitution patterns, with a bit of extra logic added here and there, to work around errors, and to detect when expressions such as "most luxurious" are being used in a listing, and when they are being used in summary of the city.  I just happened to also add a few patterns to postlink, fix the http thing, a few currency things, a few phone number things, and remove external links that we know to be bad, and probably a few things I have forgotten now,  and then for some stupid reason I decided to run it over all the articles, by hand.  Not sure why I decided to do that.  I have a prelim check that makes sure the article has something significant that needs fixing before I fix it, so I don't waste time on articles that don't.
 * Even simple things like postlinking, are hard to let run automatically. Some people repeat the URL, some put "click here", both of which can be removed, but others a noun, which has to be preserved.  I've got common phrases done, but I don't think I could ever get every exception.
 * As far as groups to publicise our projects, this is it I'm afraid. We could launch a wikichecking expedition, but I doubt we would get many too enthusiastic followers, most people are off doing stuff in various areas, and Wikivoyage is quite small.
 * The upside is that the amount of content here is much less than WP, and there are people here who spend time patrolling who I'm sure would be open to a bit of assistance in keeping a handle on errors and touts, so that things don't slip through the cracks, and the quality of the articles improves. There may be support for robots to accomplish simple tasks like fixing up url syntax, common and definitive spelling errors, and either lists of suspicious articles, or auto-reversions of suspicious edits.
 * Meanwhile, I'm happy to say that think your list is now out of date. I'm happy to post my patterns and logic if you are interested.  If you are going to run your script again, if I could suggest that you add a link to open the page for edit, it saves a click or two..   --(WT-en) inas 02:35, 19 June 2009 (EDT)
 * There are 40438 errors to fix. Minus the ones you corrected last week, it must be something like 39438 errors remaining. That's too much for one person but it can be achieved by a community. This is a wiki: everyone's contributions add up to form a great result. I am sure many people, when not traveling, would be happy to help fixing articles about exotic places, if provided with clear instructions and a list. (WT-en) Nicolas1981 00:09, 21 June 2009 (EDT)


 * By all means. Wikivoyage has Project:Expeditions, which you can start, solicit comment, and even store the output of scans within that namespace. I'm happy to help you, if you would like to do that.  Although you have found around 40,000 errors, there are probably only around 3000 or so articles in Wikivoyage that are ready for polishing.  Any others are more desperate for organisation and content than anything else.  For example, a missing ispartof/isin may just be due to an omission to a well defined region,  in which case adding it to an article takes a few seconds.  However if the region is poorly defined than it is much more involved, and is more about the regioning expedition than it is about syntactic  checking.  As you say, however, if there are people who wish to join in the effort, many hands..  --(WT-en) inas 00:21, 22 June 2009 (EDT)


 * I think I have covered all of the UK destinations, from Aberarth to Zennor, in the (WT-en) no-ispartof-nor-isin list. So that is about another 100 errors corrected! (WT-en) Tarr3n 12:01, 22 June 2009 (EDT)
 * Great, thanks a lot ! It is a good practice to remove entries you have processed. Cheers (WT-en) Nicolas1981 01:18, 26 June 2009 (EDT)
 * Oops sorry. Wasn't sure if it was appropriate to do that as I wasn't sure how the list was generated in the first place. I have now removed all those entries from the list, and found another couple I had missed the first time. (WT-en) Tarr3n 06:20, 23 July 2009 (EDT)


 * I've gone through the name missing list from A-D.

--(WT-en) inas 00:08, 23 June 2009 (EDT)
 * 1) A missing name is not an error.  A better test may be if there are no descriptions in the xml as well as no name.  This effects around 50% of those article listed
 * 2) Travel topics, itineraries, or phrasebooks do not and should not have ispartof information.  The check should report as errors those that do.
 * 3) Top level continents do not and should not have ispartof information.

Housekeeping
Swept in from pub:

There are a couple of tasks that could use more hands, neither vital but both useful. I try to do a couple of each daily, and others are clearly doing quite a few. However, it needs a few dozen more people to get them all done.

One is welcoming new users. Find them by looking at the Recent changes page, go to their talk page (not the user page, a mistake I've made too often), insert ~ and save the page.

Another is adding IsPartOf to articles that lack it, so that breadcrumb navigation works, you get a list like "Europe France South Marseille" at the top of the article. There is a list of articles without these at User:(WT-en) Nicolas1981/Syntax checks/no-ispartof-nor-isin (thanks, Nicolas!). If you fix an article, or find it is already fixed, please delete it from the list. (WT-en) Pashley 00:13, 7 December 2009 (EST)


 * Hi Sandy, I had a look on Nicolas page and the list must be rather outdated. The Bangkok, Chicago and Paris articles on the list were all linked and had a working ispartof/isin link. I will have a look but some articles are even Star articles for quite some time (e.g. Paris arrondissements are for years star) (WT-en) jan 09:38, 8 December 2009 (EST)


 * well, afaik it's not trivial to generate a new list unless we get hold of Nicolas script (and a new one would not be sorted, which took some effort). So I think just removing any entries already updated is the best solution. I've done a fair bit of work on this lately, and I'd say it's only about 1-in-10 that has been updated. --(WT-en) Stefan (sertmann) talk 14:48, 10 December 2009 (EST)


 * Until recently, noone thought to add isPartOf to districts, because breadcrumb navigation on districts doesn't use it. The IsPartOf is only required for the RDF.
 * I have scripts which check for missing isPartOf, as well as many other syntax type things, from time formats to spelling, incorrect section headers, and first party pronoun use in listings.
 * I'm happy to run them to update lists if anyone wants updated ones.--(WT-en) inas 15:06, 10 December 2009 (EST)


 * Do you have one that can exclude articles categorized as itineraries, travel topics, disambiguation pages and phrasebooks? I'm not keen on sorting out all those again. --(WT-en) Stefan (sertmann) talk 15:46, 10 December 2009 (EST)


 * It ignores redirects altogether. It regards isPartOf in disamb, travel topics, and itinerary articles as errors.  The real problem is these pseudo-regions, which are very hard to tell from a normal region article, and there is a good argument for not putting isPartOf on them, and I generally ignore them.  At the same time as doing this I'm trying to parse the regional hierarchy, so I'll be able to eventually tell a dead-end region from a real one, but there is heaps of RDF that could make this task easier.  It also hazards a guess at what the isPartOf should be, it understands subpages should be isPartOf parent pages, and it understands lead lines of the form "zorktown is a city/town in zork" --(WT-en) inas 17:06, 10 December 2009 (EST)


 * What are pseudo-regions&mdash;an extra-hierarchical region? If so, no isPartOf should direct to it, but the article itself should indeed use isPartOf, as with any destination guide. --(WT-en) Peter Talk 18:39, 10 December 2009 (EST)


 * From what I understand, this is an example. --(WT-en) Stefan (sertmann) talk 19:39, 10 December 2009 (EST)


 * Ah, would it be appropriate to add disambiguation templates to such articles? --(WT-en) Peter Talk 20:51, 10 December 2009 (EST)


 * Perhaps a more customized version that makes it clear the page isn't trying to disambiguate between multiple places with the same name, but rather that we don't have a single article for the named place. (WT-en) LtPowers 21:49, 10 December 2009 (EST)


 * Is it allowed to remove the phrasebook articles and other ones that don't need IsPartOf? Would clean up the list as well. (WT-en) Globe-trotter 09:13, 14 December 2009 (EST)


 * There is good progress on Nicolas' list, thanks to a number of people. Once that is done, Inas's script should be run for an up-to-date list.
 * Another task that needs more hands is adding related tags to all the articles for places on the UNESCO World Heritage List; see Talk:UNESCO_World_Heritage_List. More generally, the Project:World_Heritage_Expedition is already doing good things but could use more participants. (WT-en) Pashley 20:32, 22 December 2009 (EST)


 * We have a UNESCO CotM in the works, which I think could finish that task. --(WT-en) Peter Talk 22:19, 22 December 2009 (EST)

Discussion..
Just thought I'd put a pointer to this discussion, for those interested in syntax checks. --(WT-en) inas 17:38, 6 June 2010 (EDT)


 * I think it would be great to start publishing these sorts of reports so that anyone who is interested can get involved with cleanups. It might help to have some sort of legend to explain what the various codes mean, but otherwise I'd be in favor of formalizing this as soon as possible from Project:Syntax checks if you're interested in doing so. -- (WT-en) Ryan &bull; (talk) &bull; 21:14, 6 June 2010 (EDT)


 * Okay, happy to do that. Please bear with me while I experiment with different layouts, to make them useful to others.
 * One thing, though. It would be nice to have some way of indicating that an article doesn't comply with normal style guidelines, so the script won't keep on reporting the same errors.  Examples being things like hotels being sorted in an order that is non-alphabetical, or using first person pronouns in listings, or having a second level heading that isn't a stamdard one.  I'll give some though to that.. --(WT-en) inas 23:15, 6 June 2010 (EDT)


 * The format of the sample at User:(WT-en) Inas/Syntax sample now looks very clear, so I don't think a legend is really necessary. In the all can be fixed automatically except the linkcount and tout items.   --(WT-en) inas 05:49, 7 June 2010 (EDT)