Benutzer Diskussion:Stefan Kühn/Check Wikipedia/Archiv/2009/Juni

aus Wikipedia, der freien Enzyklopädie

Feature request: instant feedback

Could you have a link next to the results of each check which triggers that check to run again and updates the list of errors? It would only have to re-check the first 50 errors, as those are the ones shown. This would stop people checking errors which had already been fixed, or making changes which don't actually fix the errors.

Sorry I don't understand. Did you mean a link, which start the script to check the next 50 articles? Please describe it better. -- sk 17:49, 1. Jun. 2009 (CEST)
Sorry, let me give a clearer example. Consider the section http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Check_Wikipedia#Double_pipe_in_one_link which has a list of 50 articles out of 860. Now imagine there was a link / button at the start of this section, and when you clicked it the script would run the test for "Double pipe in one link" against those 50 results, and remove from the list any that had been fixed. The list would stay at 50 results, though, by being repopulated from the remaining 810 (which are assumed to be still broken).
Hello IP, ok I understand this feature request. At the moment I can not include this in my script. It is very complex and I am happy that it work. But I think about the next generation of this script and this feature will be a good one. But than I can not include this in Wikipedia because I have no idea how to do this. I can write an special page in perl and include this there, but this need time. Many time! :-) -- sk 09:44, 2. Jun. 2009 (CEST)
In the next generation of the script you could have the data stored in a database (if you don't already) and when someone asks for one type of error to be retested, the script can pull out the top 50 results for that error from a database table, retest each one and delete any which are now fixed. Then you can have a function in the script which takes the top 50 results for each error and generates the updated wiki page for it, replacing the old one. I don't actually know how your script works, but it is impressive. Maybe you could explain what is hard about this feature. If you can't use a database, then you could have a separate page for each error which stores the data in tab-delimited format, for example.

Check 75

Siehe bitte [1]. In dem Fall wäre es wohl besser ein einfaches Aufzählungszeichen zu verwenden. --Matthäus Wander 00:54, 2. Jun. 2009 (CEST)

Ich hab mal die sinnlose Einrückung ausgebaut. Die Überschriften reichen zur Differenzierung. -- sk 09:40, 2. Jun. 2009 (CEST)

Check #71 - X at wrong position for ISBN

I'm trying to figure out if there are some false positives for this check, on the Chinese wikipedia. The description of the script said that it checks that X is at position number 10 only. Some of the results for the Chinese wikipedia are length 13 ISBN. Are the X supposed to be at position 10 or 13 for them? For example, zh:旋風管家 contains ISBN of 978-4-09-127272-X. --Vina 08:44, 3. Jun. 2009 (CEST)

Only in a ISBN-10 is a "X" allowed. If you find a "X" in an ISBN-13 then it is an error. See ISBN. If the checksum in ISBN-10 is a 10 then you write "X". If the checksum in a ISBN-13 is a 10 then you write "0". So a "X" in an ISBN-13 is wrong. Sometime the publisher write the wrong number at the book. In en/de/sv we have a template for this wrong ISBN-Numbers. See en:Template:Listed Invalid ISBN. -- sk 08:56, 3. Jun. 2009 (CEST)

error ? in check #34

Why this check not discover [2] and [3]. There was | class="float{{{1}}}" width="{{{width}}}" align="{{{1}}}" style="background-color:inherit;border-collapse:collapse;border-style:none;margin: .5em .75em;" some code form pl:Template:CytatD with {{{. Malarz pl 11:39, 5. Jun. 2009 (CEST)

sub error_034_template_programming_elements is checking @lines, which contains no table code. In those articles template elements (parametrs was in a table). Malarz pl 11:50, 5. Jun. 2009 (CEST)
At the moment I exclude tables from the check. -- sk 12:06, 5. Jun. 2009 (CEST)
why ? Malarz pl 08:14, 6. Jun. 2009 (CEST)
When I wrote this script, I had many problems with tables. Some problems I have never solved. For example: table inside table. It is very tricky to check this table. So I make in the first time the quick and dirty way. I exclude the tables. :-) Maybe in the future I must change my way. But this need a little time. -- sk 21:30, 6. Jun. 2009 (CEST)

wrong <pre> exclusion

I found, that your script propably check code in <pre style="height:20em; overflow-y:scroll">. It works fine, when <pre> tag is used without parameters, but not in this case. The problem is in pl:dmesg and check #56 and propably few next articles. Malarz pl 22:48, 3. Jun. 2009 (CEST)

I will check this. -- sk 12:16, 4. Jun. 2009 (CEST)
 Ok, I have change the code. If you see it again, then tell this here. Thanks. -- sk 13:38, 7. Jun. 2009 (CEST)

DEFAULTSORT parameter starting with a white space

Hello Mr. Kühn,

Sometimes, I found DEFAULTSORT starting with a white space :

{{DEFAULTSORT:             Doe, John}}

This is a mistake.

Keep on the good work.

Regards,

Cantons-de-l'Est 11:36, 4. Jun. 2009 (CEST)

Very interesting idea. I will try to insert this. Thanks. -- sk 22:28, 4. Jun. 2009 (CEST)
 Ok, I insert the new error 88. -- sk 21:46, 7. Jun. 2009 (CEST)

check 016

Hello, I just found a problem with check 016 when I tried to delete control character #x200B in the article cs:Czechowice-Dziedzice. It seems to me that this character is somehow conected to some IPA characters like "͡" in this example and when I try to delete control character I also delete IPA character. Is there any solution on my side or some exception in the script is needed? --Reaperman (cs) 14:51, 4. Jun. 2009 (CEST)

I have read yesterday somewhere the same problem. But now I don`t find it! I will check this at the weekend. I think I must exclude this character. -- sk 22:32, 4. Jun. 2009 (CEST)
Ahja, here I see this yesterday. -- sk 07:55, 5. Jun. 2009 (CEST)
 Ok, I have change the code for error 16. -- sk 13:41, 7. Jun. 2009 (CEST)

<noinclude> and others

Maybe <noinclude> in article space is not error, but <noinclude></noinclude> or <noinclude>\n</noinclude> (with newline) is. The same with tags <includeonly> and <onlyinclude>. Examples: [4], [5] Malarz pl 11:34, 5. Jun. 2009 (CEST)

Interesting idea. I will try this. -- sk 12:06, 5. Jun. 2009 (CEST)
 Ok, new error 85. -- sk 13:55, 7. Jun. 2009 (CEST)

Misformatted external links

I can't remember if I already asked for this one or not; if so, my apologies... it must have been archived. Anyway, I was wondering if the script could detect external links which have double brackets, rather than single brackets, around them? This causes display errors like [this]. It would also be great if it could search for external links which contain a pipe | symbol, since this is often a sign of someone trying to separate the link's target and its description in the same way as with an internal link. Thank you! Keep up the great work! -Drilnoth (Talk) 18:57, 6. Jun. 2009 (CEST)

 Ok, new error 86. I dont check for the pipe symbol, because I found a courrect weblink with a pipe. Somthing like http://www.xyz.abc?test=asd|asasd&sdfsdf. --sk 14:08, 7. Jun. 2009 (CEST)
Good point; thanks! -Drilnoth (Talk) 17:56, 7. Jun. 2009 (CEST)

Broken character entity references

(from en:Wikipedia talk:WikiProject Check Wikipedia)

Any chance you could run a script to find things like [6], [7]. I've been finding a lot of these lately where the semi-colon is missing. Obviously this would be listed as a higher-priority error. — CharlotteWebb 21:07, 6 June 2009 (UTC)

Thought I'd mention it here. -Drilnoth (Talk) 01:27, 7. Jun. 2009 (CEST)
 Ok, new error 87. -- sk 14:25, 7. Jun. 2009 (CEST)
Excellent; thank you. -Drilnoth (Talk) 17:56, 7. Jun. 2009 (CEST)
Hello. Could you remove external links from checking this error? It seems that it makes false positives. --Reaperman (cs) 11:22, 8. Jun. 2009 (CEST)
I will fix this today. -- sk 13:27, 8. Jun. 2009 (CEST)
 Ok, I have deactivated this error. -- sk 21:45, 8. Jun. 2009 (CEST)

Wrong description

English description for error 86 is wrong.

The script found a link with two brackets to external source like [[http://www.wikipedia.org Wikipedia]]. External links only need one bracket like [[http://www.wikipedia.org Wikipedia]].

The second example should have only one pair of brackets.

And desc for error 85 has extra ". at the end.

The script found a tag without content or a line break like <noinclude></noinclude>. This tag can be deleted.".

I don't like the hack used in desc for error 87 "<tt>&a<code></code>uml;</tt>". --fryed-peach 17:48, 8. Jun. 2009 (CEST)

Desc for error 91 has a word "bedin". I suppose it should be "beginning" instead. --fryed-peach 18:02, 8. Jun. 2009 (CEST)
You can use &amp;uml; to decode html entity instead of a tag-hack. --Der Umherirrende 19:49, 8. Jun. 2009 (CEST)
 Ok, I have insert all things. Thanks for this info. Sorry for my broken english. :-) -- sk 21:50, 8. Jun. 2009 (CEST)

HTML named entities without semicolon

Script find links. See [8]: all are false positives. Matma Rex answer me on plwiki 15:20, 8. Jun. 2009 (CEST)

I have deactivated this section. I will make a better version. -- sk 08:52, 9. Jun. 2009 (CEST)

A slightly odd request

I have another request related to DEFAULTSORTs. The English Wikipedia's guidelines at en:WP:CAT#Using sort keys states" 'Don't begin sort keys with lower case letters, unless you want to create a separate sublist (the ordering places lower case letters after all capital letters). To ensure that entries differing by letter case appear together, apply the convention that initial letters of words are capitalized in the sort key, but other letters are lower case. For example, use 'Dubois' in sort keys rather than 'DuBois'."

My bot has been approved to add and modify DEFAULTSORTs to ensure that they are inline with this guideline. To aid in finding articles which need a DEFAULTSORT because of this (or need a current DEFAULTSORT modified because it isn't in line with this), it would be much appreciated if a scan could be made which would check for article titles which:

A) Contain one or more words which start with lowercase letters, but have no DEFAULTSORT, or have a DEFAULTSORT which contains lowercase letters at the start of a word. For example, en:Role-playing game, en:2004 in film, which should have DEFAULTSORTs of "Role-Playing Game" and "2004 In Film".

B) Contain one or more words with capitalization in the middle of the word, but have no DEFAULTSORT, or have a DEFAULTSORT which contains capitalization in the middle of a word. For example, en:Lewis DuBois, en:SSX, which should have DEFAULTSORTs of "Dubois, Lewis" and "Ssx".

These DEFAULTSORTs aid in category organization because capital letters are listed before lowercase letters be default, but this causes some odd sorting issues.

I know that this is a rather odd request and it certainly isn't going to be needed in every language, but it would be very much appreciated for the English Wikipedia. Thanks! -Drilnoth (Talk) 18:50, 6. Jun. 2009 (CEST)

Very interesting. I will build two new errors for you. Maybe tomorrow. -- sk 21:33, 6. Jun. 2009 (CEST)
Thank you! I know that these will be very long lists, but it will be much appreciated whenever you have a change. (oh, and the third paramter of the "Liftime" template also functions as a DEFAULTSORT, if it's not too hard to code that into your script to). -Drilnoth (Talk) 01:25, 7. Jun. 2009 (CEST)
 Ok, A and B is a new error. Today I have no time for "Liftime" -- sk 22:14, 7. Jun. 2009 (CEST)
Okay; thanks! -Drilnoth (Talk) 03:05, 8. Jun. 2009 (CEST)

Would it be possible for this error (and the other DEFAULTSORT-related ones) to exclude any pages with the text "#REDIRECT"... redirects almost never have categories, so having them listed is kind of pointless (they don't need DEFAULTSORTs). Alternatively, you could only list articles as having such errors if they contain "[[Category:" in them. -Drilnoth (Talk) 23:43, 8. Jun. 2009 (CEST)

In dewiki we have many categorys and Persondata in REDIRECT-Articles. There we need this. But the alternate way is possible. I will try this. -- sk 08:51, 9. Jun. 2009 (CEST)
Okay; sounds good. It just seems pretty pointless to add DEFAULTSORTs to pages which don't have categories. :) -Drilnoth (Talk) 16:08, 9. Jun. 2009 (CEST)
 Ok, there must be a category in an article for error 91. -- sk 21:57, 9. Jun. 2009 (CEST)

Thank you

Thank you for trying to figure out a way to reduce this CPU usage problem. I don't know if its good or bad that enwiki has so many errors that we can just keep using this same list for days while you fix it. :) -Drilnoth (Talk) 04:21, 15. Jun. 2009 (CEST)

No, it is not a problem of en. It is a problem of the toolserver and of my script, when it get the articletext. At the moment I use the API with the statement:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&titles=Paris

So I can only get one article. Better is to make it like this:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&titles=Paris%7CDresden%7CBerlin

I will try to insert this in the script, but this is a bigger work. -- sk 21:39, 16. Jun. 2009 (CEST)

Just a thought though I don't exactly now how the script works and how access rights on toolserver are defined: would it be possible to request the full text of the set of articles from the toolserver copy in MySQL? -- User:Docu

Hello Docu, I don't use MySQL. I only use the API (see above) and perl. -- sk 21:40, 16. Jun. 2009 (CEST)
It might be a way to limit the resources being used. It would retrieve the full text of the articles to check all at once, directly from the database. -- User:Docu

Problem Erstellung html-Ausgabe nach Anpassung aufgrund hohem Ressourcenverbrauchs

Hallo Stefan! Nach dem Workaround aufgrund der hohen CPU Belastung wird zwar die Textdatei erstellt (z. B. http://toolserver.org/~sk/checkwiki/dewiki/dewiki_output_for_wikipedia.txt), aber die html Ausgabedateien (http://toolserver.org/~sk/checkwiki/dewiki/dewiki_output_for_wikipedia.html) wird nicht aktualisiert. Dies scheint ein Problem in allen Sprachen zu sein.--Video2005 21:57, 15. Jun. 2009 (CEST)

Danke für den Hinweis, schau gleich mal woran das liegt. -- sk 21:21, 16. Jun. 2009 (CEST)
 Ok, sollte morgen laufen. -- sk 21:26, 16. Jun. 2009 (CEST)
Besten Dank! --Video2005 22:16, 16. Jun. 2009 (CEST)

Problems

Did you run a new version of the script? It propably shows bad pairs (article name, error code) in some cases. I've checked some high priority errors on pl.wiki and there was no code as indicated was in your tables (and wasn't in previous versions of the arctiles). Malarz pl 21:25, 21. Jun. 2009 (CEST)

You are not alone! I've asked him the same in German [9]-- Ben Ben 23:59, 21. Jun. 2009 (CEST)
Malarz pl, I've changed your English, maybe I shouldn't do that - people could think that's impolite. If so, please say it - I wouldn't do that anymore.-- Ben Ben 23:59, 21. Jun. 2009 (CEST)
Shit. I have wait for this event, but in my tests I have found no of this problems. I will fix this tonight. See also this Info about the new version of the script. -- sk 07:13, 22. Jun. 2009 (CEST)
I have stopped the cronjob for today. -- sk 07:15, 22. Jun. 2009 (CEST)
 Ok, I have fixed this problem. This where two problems of the API. 1.) Only 50 articles allowed. 2.) The order of the article in request can be change. I hope it work tonight. I have start a new live scan. -- sk 22:55, 22. Jun. 2009 (CEST)

Pre mit undefiniertem Ende

Info.xml wird erkannt, obwohl es dort kein pre gibt, welches nicht geschlossen ist. Der Info-Text war „<prename>Hansjoerg</prename> <surname>Petry</surname> <street>Gerressener“. Dieser Textausschnitt befindet sich aber innerhalb von source-tags, sollte also dort nicht erkannt werden. Der Umherirrende 00:29, 12. Jun. 2009 (CEST)

 Ok, erledigt. --sk 09:25, 1. Jul. 2009 (CEST)

Error 61 could be extended and more efficient

Hi, Stefan!

Your program does not recognize self containing references followed by punctuation char:

<ref name="foobar"/>?

Anyway this 12 index() calls on whole text

	$pos = index( $text, '</ref>.') if ($pos == -1);	
	$pos = index( $text, '</ref> .') if ($pos == -1);
	$pos = index( $text, '</ref>  .') if ($pos == -1);
	$pos = index( $text, '</ref>   .') if ($pos == -1);
	$pos = index( $text, '</ref>!') if ($pos == -1);
	$pos = index( $text, '</ref> !') if ($pos == -1);
	$pos = index( $text, '</ref>  !') if ($pos == -1);
	$pos = index( $text, '</ref>   !') if ($pos == -1);
	$pos = index( $text, '</ref>?') if ($pos == -1);
	$pos = index( $text, '</ref> ?') if ($pos == -1);
	$pos = index( $text, '</ref>  ?') if ($pos == -1);
	$pos = index( $text, '</ref>   ?') if ($pos == -1);

could be replaced with a single regular expression match like

	$text =~ m|</ref> {0,3}[?!.]|;

or even more

	$text =~ m|</ref> *[?!.,]|;

I guess it is a bit faster (especially if no hit) and defintely more scalable.

Cheers

Bitman --193.6.17.70 13:52, 28. Jun. 2009 (CEST)

Thanks for this info. I will try this, but I am not a perl-Guru. Only learning by doing. I will test this. -- sk 09:19, 1. Jul. 2009 (CEST)

misused id or class

Hello, can you add new feature - detecting articles containing <span id="foo"> or <span class="foo"> and the same for <div>. Sometimes there are misused some classes like this. I think there is no need to use it in text, maybe in code or templates. JAn Dudík 08:20, 30. Jun. 2009 (CEST)

Ok, I will try this at in the next days. -- sk 09:20, 1. Jul. 2009 (CEST)

Categories

Can you sometimes run your script to categories too? some errors are the same. JAn Dudík 08:20, 30. Jun. 2009 (CEST)

My script work with the dumps. It scan all pages. Also Categories. But in the most errors I only check the namespace 0 (articles) and namespace 6 (images). Not more. Which errors should also check in namespace 14 (category)? -- sk 09:24, 1. Jul. 2009 (CEST)

kleiner Fehler und Funktionswunsch

  • ist es möglich Menschenlesbarkeit noch aus dem ISBN-13 rauszunehmen?
  • und könntest du unter der Tabelle noch drei Zeile anhängen die jeweils für die einzelnen Prioritäten die Summen angeben? So könnte man sehen in welchen Bereich welche Anzahl von Fehler drinstecken. Bsp:
nr. name script dewiki previous scan last scan trend change
Summe -- -- low 1000 800 -200
Summe -- -- middle 2000 2050 50
Summe -- -- high 500 123 -377

--Goforgold 15:06, 25. Jun. 2009 (CEST)

Ich möchte nochmal kurz auf Menschenlesbarkeit hinweisen (schaut so aus, als ob der vergessen wurde..) und auf dies, sowie dies hinweisen. Ich vermute mal das Skript ist irgendwo Amok gelaufen. Gruss -- Goforgold 17:10, 6. Jul. 2009 (CEST)

Danke für den Hinweis. Das mit Menschenlesbarkeit mach ich nocht. Das andere sind nur ein Schluck-Auf des Skriptes. Wird sicherlich beim nächsten Lauf nicht drin sein. Muss irgendwas mit der API gewesen sein oder Netzwerkprobleme. Am Skript selbst hab ich in den letzten Tagen aus Zeitmangel nichts machen können. -- sk 17:31, 6. Jul. 2009 (CEST)

Error #003 in Turkish Wikipedia

Good evening. Could you please modify the script so that it not only searches for <references /> but also {{reflist}}? The reason why this is required is that the output shows the valid pages (with reference tags) as if they do not have any reference tags. Thanks! ----Superyetkinileti 20:07, 31 May 2009 (UTC)

Hello Superyetkin, this is already feature of the script to search for "reflist". Can you tell me the article where the script don't found the reflist-template. I think there is an other problem. -- sk 17:48, 1. Jun. 2009 (CEST)
Hello there, sorry for the late reply.
The problem is that this template, which is used in featured articles, is not recognized by the script and these articles come up with "missing <references />" errors on the project page. You can see the current situation here Could you please examine the issue and resolve it? Thanks for your help. --Superyetkin 21:10, 9. Jul. 2009 (CEST)

Error #003 in Turkish Wikipedia

Good evening. Could you please modify the script so that it not only searches for <references /> but also {{reflist}}? The reason why this is required is that the output shows the valid pages (with reference tags) as if they do not have any reference tags. Thanks! ----Superyetkinileti 20:07, 31 May 2009 (UTC)

Hello Superyetkin, this is already feature of the script to search for "reflist". Can you tell me the article where the script don't found the reflist-template. I think there is an other problem. -- sk 17:48, 1. Jun. 2009 (CEST)
Hello there, sorry for the late reply.
The problem is that this template, which is used in featured articles, is not recognized by the script and these articles come up with "missing <references />" errors on the project page. You can see the current situation here Could you please examine the issue and resolve it? Thanks for your help. --Superyetkin 21:10, 9. Jul. 2009 (CEST)
 Ok, I have insert the "Şablon:Kayan kaynakça". -- sk 21:55, 3. Aug. 2009 (CEST)
Thanks! --Superyetkin 10:03, 4. Aug. 2009 (CEST)

Falsches Gradzeichen

Hallo Stefan, könntest Du für WP:BA #Nummernzeichen statt Gradzeichen nicht schon eine Liste erstellen. Das macht sich leichter, als ggf. die ganze Datenbank zu durchlaufen wenn man keinen Dump nutzt. -- @xqt 08:55, 12. Jun. 2009 (CEST)

Ich hab es auf meine To-do-list gesetzt. -- sk 21:59, 3. Aug. 2009 (CEST)
Ich lass das mal lieber raus, weil ich nicht weiß ob auch in anderen Sprachen das so gehandhabt wird. Scheinbar hat das ja mit dem Bot geklappt. -- sk 21:07, 18. Aug. 2009 (CEST)