Gospel Translations:Technology/Multi-wiki project/basic bot.php

From Gospel Translations

Jump to:navigation, search
  1. <?php
  2. /*
  3. 		**********************
  4. 		BasicBot	v1.22	c2007
  5. 		**********************
  6.  
  7.  
  8. Adam's bot template. Use this to create your own PHP-based bots for MediaWiki sites.
  9.  
  10. Before beginning, you should DEFINITELY read this, even if your bot will be on a wiki other than Wikipedia:
  11. 	* http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot
  12.  
  13. Reading this will help you understand some of the code in the class below:
  14. 	* http://www.mediawiki.org/wiki/Manual:Parameters_to_index.php
  15.  
  16. You will also need the following:
  17. 	* PHP 5. If not, check my tutorial for a small patch that should make this work in PHP 4 (from what folks tell me, anyway)
  18. 	* Snoopy. Snoopy is a PHP class that makes it MUCH easier to submit forms (like MediaWiki edit forms, for example) via PHP.
  19. 		You can download Snoopy here:		http://snoopy.sourceforge.net/
  20. 		You should probably take a quick look at Snoopy's readme before trying to understand my code. My code simply extends the snoopy class.
  21.  
  22. What you'll find below is a template for building your own PHP-based bots to do automated tasks on sites powered by MediaWiki. I don't care what tasks you have in mind,
  23. this template should help you out. I've put quite a bit of commenting into this file to make your life easier. However, you should probably read my tutorial first,
  24. which gives a nice clear explanation of what this file does and how to use it. In fact, if you read the tutorial, you probably won't even need to look at this file in order to use it, at least
  25. for relatively basic bots.
  26. 	* My tutorial:	http://wikisum.com/w/User:Adam/Creating_MediaWiki_bots_in_PHP
  27.  
  28. Recommended usage: Don't edit this file. Just include it into another file, then extend the class or define new callbacks as needed. That way, you can use
  29. this file for as many bots as you want without making the file unwieldy and huge. This file should have come with a companion file, ChangeCategory.php, that
  30. should give you an idea of what I mean.
  31.  
  32. IMPORTANT: Be sure to check the settings below before getting started.
  33.  
  34. TROUBLESHOOTING
  35. Please note that I developed this code for my own use. I haven't tested it on any wiki other than wikisummary.com, and I haven't tested it in any environment
  36. other than the one I run my wiki on (PHP 5, Linux). I share this code in hopes that it will be helpful to you--when I wrote this, I was unable to find anything comparable
  37. out there. But I'm sure you'll come across bugs as you try to use this, since I really haven't tested it for anything other than what I use it for. You are especially
  38. likely to have problems with my link harvesting functions, since some of them require CSS classes or IDs that my wiki's custom template uses but your wiki probably doesn't use.
  39.  You're mainly on your own when it comes to solving bugs--I don't really have the time. But here are a few suggestions that might help:
  40. 	* Start by double checking all the settings below.
  41. 	* Make sure your cache and temp directories exist on the server and are writable.
  42. 	* If you're getting an error message, search this file for that message to see what caused it.
  43. 	* For help writing callback functions, look at the very end of this file. See also the companion file, which is a demo of a fully functional bot.
  44. 	* My code assumes that your wiki requires logging in to edit, not to read. If you must log in to read, search this file for "read" and uncomment the relevant parts.
  45. If you do find and fix a bug, please send me a patch so that others can benefit. Thanks.
  46.  
  47. HOW TO SAY THANKS:
  48. 	* I appreciate links to one or more of these:
  49. 		My wiki: 		http://wikisum.com (could really use some inbound links...)
  50. 		My site: 		http://adambrown.info
  51. 		This script:		http://wikisum.com/w/User:Adam/Creating_MediaWiki_bots_in_PHP
  52. 	* If you fix a bug, send me a note. You'll find my contact info at http://adambrown.info/p/about
  53. */
  54.  
  55. ##########################################################
  56. ##########################################################
  57. //		SETTINGS
  58. ##########################################################
  59. ##########################################################
  60.  
  61. // we'll detect our absolute path. You can override this is you want.
  62. $abspath = dirname(__FILE__);
  63.  
  64. // adjust as necessary
  65. define('SITECHARSET','UTF-8');
  66. define('SERVER','http://gospeltranslations.org');
  67. define('PREFIX','/w'); // no trailing slash. The prefix you use for index.php?title= links (e.g. editing links). Set to '' if you use no prefix other than what's in SERVER.
  68. define('ALTPREFIX','/wiki'); // no trailing slash. The prefix on valid links that visitors usually see. Might be the same as PREFIX if you don't use "pretty" links.
  69. define('CACHE', $abspath.'/cache/'); // a path where we can store cache files to. SHOULD EXIST and be writeable by the server. Stored for longer than files in TEMP.
  70. define('TEMP',$abspath.'/temp/'); // a path where we can store temp files to. SHOULD EXIST and be writeable by the server. Can be the same as CACHE if you want.
  71. define('COOKIETIME', 3600); // how many seconds should we hold our login cookies before refreshing them? Defaults to 3600 seconds = 1 hour.
  72. define('DELAY', 1); // default number of seconds that bots wait between requests to the wiki. Check your wiki's policies. Set to at least 30 if you aren't sure.
  73.  
  74. // if you want, you can put your default userid, username, and password into a separate file. I do it just so I don't accidentally upload a copy of this with my username and password inside :)
  75. if (file_exists('username.php'))
  76. 	require_once('username.php');
  77. // ELSE you need to fill out the next few settings.
  78. if (!defined('USERID')){	define('USERID','3');} // find it at Special:Preferences 
  79. if (!defined('USERNAME')){	define('USERNAME','Transbot');}
  80. if (!defined('PASSWORD')){	define('PASSWORD','gospel');} // password in plain text. No md5 or anything.
  81.  
  82.  
  83. ##########################################################
  84. ##########################################################
  85. // DONE WITH SETTINGS.
  86. ##########################################################
  87. ##########################################################
  88.  
  89. require_once('Snoopy.class.php'); // you can change this if snoopy is someplace else, obviously
  90.  
  91. if (is_array($_GET)){
  92. 	foreach( $_GET as $GETkey=>$GETval )
  93. 		if (!is_array($GETval)){$_GET[$GETkey] = stripslashes($GETval);}
  94. }
  95.  
  96. global $passv;
  97.  
  98. ##############################################################
  99. ############## you probably don't need to edit this ##########
  100. ############## look at the end of this file for more help ####
  101. ##############################################################
  102. class BasicBot extends Snoopy{
  103. 	var $wikiUserID = USERID;
  104. 	var $wikiUser = USERNAME;
  105. 	var $wikiPass = PASSWORD;
  106. 	var $wikiServer = SERVER;
  107. 	var $wikiCookies; // will hold the file name where our cookies are stored.
  108. 	var $wikiConnected = false;
  109. 	var $wikiTitle; // holds the title that wikiFilter has just called from. Makes it easy to know where we are when doing a FilterAll function.
  110.  
  111. 	/***************************************
  112. 	FUNCTIONS THAT YOU ARE LIKELY TO INTERACT WITH START HERE
  113. 	****************************************/
  114.  
  115. 	// wikiFilter is a single-use filter. You'll probably call this directly only when you are testing a new filtering callback. Otherwise, try wikiFilterAll() instead.
  116. 	// You don't need to edit this function to change filter behavior. Instead, create a new CALLBACK function (see the end of this file for examples).
  117. 	// grabs the content of $title, then passes it to $callback, which returns the new content. If the new content is different from the old content, this function edits the wiki with the new content.
  118. 	function wikiFilter($title,$callback,$summary='',$callbackParams=array()){
  119. 		if (!$this->wikiConnect())
  120. 			die ("Unable to connect.");
  121. 		// $this->fetchform doesn't work for us because it strips out the content of the <textarea>. So use $this->fetch instead first:
  122. 		$this->wikiTitle = $title; // for use by callbacks in our various bots, if needed. E.g. see FindRelatedLinksBot
  123. 		if (!$this->fetch( $this->wikiServer . PREFIX . '/index.php?title=' . $title . '&action=edit' ) )
  124. 			return false;
  125.  
  126. 		// in order to save changes, you'll need to submit a few hidden vars. See http://www.mediawiki.org/wiki/Manual:Parameters_to_index.php#Parameters_that_are_needed_to_save
  127. 		// you'll need the edit token. Usually looks something like this: <input type='hidden' value="cb6843e700be730a715813304648ff20" name="wpEditToken" />
  128. 		// in later versions of MW, you might also see an edit token of "\+" (if not logged in) or a token like my example but with \ at the end. So we look for 0-9, a-z, \, and +:
  129. 		#if (1!=preg_match("|<input[^>]*value=['\"]([0-9a-z]*)['\"] name=['\"]wpEditToken['\"]|Ui",$this->results,$editToken)) // for older versions.
  130. 		if (1!=preg_match("|<input[^>]*value=['\"]([0-9a-z\\\+]*)['\"] name=['\"]wpEditToken['\"]|Ui",$this->results,$editToken))
  131. 			return false;
  132. 		$post_vars['wpEditToken'] = $editToken[1];
  133.  
  134. 		// you'll also need wpStarttime and wpEdittime
  135. 		if (1!=preg_match("|<input[^>]*value=['\"]([0-9a-z]*)['\"] name=['\"]wpStarttime['\"]|Ui",$this->results,$startTime))
  136. 			return false;
  137. 		$post_vars['wpStartime'] = $startTime[1];
  138. 		if (1!=preg_match("|<input[^>]*value=['\"]([0-9a-z]*)['\"] name=['\"]wpEdittime['\"]|Ui",$this->results,$editTime))
  139. 			return false;
  140. 		$post_vars['wpEdittime'] = $editTime[1];
  141.  
  142. 		// a couple other vars we'll need to post.
  143. 		$post_vars['wpSummary'] = $summary; // let's leave an edit summary
  144. 		$post_vars['wpSave'] = 'Save page'; // we want to save, not preview, not see diffs.
  145.  
  146. 		// now let's grab the current content and run it through our filter
  147. 		if (1!=preg_match("|<textarea[^>]*name=['\"]wpTextbox1['\"][^>]*>(.*)</textarea>|Usi",$this->results,$content))
  148. 			return false;
  149. 		$content = htmlspecialchars_decode( $content[1] ); // turn all the &quot; back into ", else MediaWiki will turn the &quot; into &amp;quot;
  150. 		$post_vars['wpTextbox1'] = call_user_func( $callback, $content, $callbackParams );
  151. 		if (false===$post_vars['wpTextbox1'])
  152. 			die( 'Callback returns an error.' );
  153. 		if ($content == $post_vars['wpTextbox1']) // no editing necessary; our callback made no changes
  154. 			return true;
  155.  
  156. 		// all done. Let's submit the form.
  157. 		$this->maxredirs = 0; // we don't want to redirect from edit from back to article, or else we won't be able to sniff response codes to check for success.
  158. 		if ($this->submit( $this->wikiServer . PREFIX . '/index.php?title=' . $title . '&action=submit', $post_vars ) ){
  159. 			// Now we need to check whether our edit was accepted. If it was, we'll get a 302 redirecting us to the article. If it wasn't (e.g. because of an edit conflict), we'll get a 200.
  160. 			$code = substr($this->response_code,9,3); // shorten 'HTTP 1.1 200 OK' to just '200'
  161. 			if ('200'==$code)
  162. 				return false;
  163. 			elseif ('302'==$code)
  164. 				return true;
  165. 			else
  166. 				return false; // if you get this, it's time to debug.
  167. 		}else{
  168. 			// we failed to submit the form.
  169. 			return false;
  170. 		}
  171. 	}
  172.  
  173. 	// 	if you're doing something (like editing) that requires being logged in, start your function with a condition like this:	if ($this->wikiConnect())
  174. 	function wikiConnect(){
  175. 		$this->wikiCookies = CACHE . 'cookies_' . $this->wikiUser . '.php';
  176. 		if ($this->wikiConnected) 		// no need to repeat all this if it's already been done.
  177. 			return true;
  178. //		if (file_exists($this->wikiCookies) && (filemtime($this->wikiCookies) > (time() - COOKIETIME)) ){ // check cookie freshness
  179. //			include_once($this->wikiCookies); 	// load cookies from cache
  180. //			$this->cookies = $cookiesCache;
  181. //			$this->wikiConnected = true;
  182. //			return $this->wikiConnected; // we have the cookies, proceed with whatever you want to do.
  183. //		}
  184. 		else{
  185. 			return $this->wikiLogin(); 	// if true, we have the cookies, proceed with whatever you want to do.
  186. 		}
  187. 	}
  188.  
  189. 	// harvests all internal links from a $source article and stores them in temp directory. (If $source is Special, use SpecialFilterAll instead.)
  190. 	// on each subsequent load, picks the next link from the list and runs it through $this->wikiFilter using the supplied callback.
  191. 	// $metaReload is number of seconds it waits between edits. Just open in a window and let it run in the background. Or you could use a cron if you want, but function assumes you don't.
  192. 	// ignores category links in the $source article by default; harvests only true internal links. Set $stripCats to false to change this.
  193. 	// it will write an edit summary using $summary.
  194. 	function wikiFilterAll($source,$callback,$summary='',$callbackParams=array(),$metaReload=DELAY,$stripCats=true){
  195. 		// don't change these next four lines; other __FilterAll() methods (e.g. SpecialFilterAll()) assume they'll look just like this.
  196. 		if (!$_GET['cache'])	// the <meta> reload will append ?cache= to the current URL. If it's there, then we know we've already harvested the links. If not, start by harvesting.
  197. 			$cache = $this->wikiHarvestLinks($source,$stripCats);	// harvest the links from $source and store them to $cache
  198. 		else
  199. 			$cache = $_GET['cache'];
  200. 		// again, don't change the preceding four lines.
  201. 		$links = $this->LoadLinksCache( $cache );
  202. 		$link = $links[0];	// use the link at the top of the cache array.
  203. 		global $passv;
  204. 		$passv = $link;
  205. 		if ($this->wikiFilter($link, $callback, $summary, $callbackParams)){
  206. 			array_shift( $links );	// remove the first link from the array.
  207. 			if (0==count($links)){	// if TRUE, then we're all done.
  208. 				unlink($cache);		// delete the cache file.
  209. 				echo('All done!');
  210. 				if (''!=$_GET['nuFile']) // used by one of my bots (the RecentChangesBot). You can delete this if you want.
  211. 					echo '<br /><br />Needs update: <a href="RecentChangesLinkBot.php?nuFile='.$_GET['nuFile'].'">'.$_GET['nuFile'].'</a>.';
  212. 				die;
  213. 			}
  214. 			$this->UpdateLinksCache( $cache, $links );
  215. 			$success = true;
  216. 		}else{
  217. 			$success = false;
  218. 			$usualReload = $metaReload;
  219. 			$metaReload = '300'; // when we fail to edit successfully, make sure we wait five minutes in case we're setting off flood alarms and that's why we're failing.
  220. 		}
  221. 		$getstring = '?cache='.$cache;
  222. 		if (1<count($_GET)){	// some form-based bots may create additional GET vars that we need to preserve (besides just "cache", which we've already got)
  223. 			foreach( $_GET as $getvar => $getvalue ){
  224. 				if ($getvar != 'cache')	// already took care of cache
  225. 					$getstring .= '&amp;'.$getvar.'='.$getvalue;
  226. 			}
  227. 		}
  228. 		$out = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  229. 			<html>
  230. 			<head>
  231. 				<title>'.$callback.'</title>
  232. 				<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  233. 				<meta http-equiv="refresh" content="'.$metaReload.';url='.$getstring.'" />
  234. 				<meta name="robots" content="noindex, nofollow" />
  235. 			</head>
  236. 			<body>
  237. 				<h2>'.$callback.'</h2>
  238. 				<p>Cache: '.$cache.'</p>
  239. 				<p>Source: '.$source.'</p>
  240. 				<p>Current: <a href="' . $this->wikiServer . PREFIX . '/index.php?title='.$link.'">'.$link.'</a></p>
  241. 				<p>Remaining: '.count($links).' articles.</p>
  242.  
  243. 		';
  244. 		if ($success){
  245. 			$out .= '<p style="color:#00f;"><strong>Successfully filtered.</strong> Waiting '.$metaReload.' seconds...</p>';
  246. 			$out .= '<p>If status code below is "200," then the callback function made no changes. If status code is "302", then changes were successfully made.</p>';
  247. 		}else{
  248. 			$out .= '<p style="color:#f00;"><strong>Filtering failed.</strong> Will wait '.$metaReload.' seconds until the next attempt instead of the usual '.$usualReload.', just in case the failure happened due to moving too fast. You can obviously just hit "reload" if you want to proceed immediately.</p>';
  249. 			$out .= '<p>If status code below is "200," then we probably encountered an editing error from the wiki. Append "&showall=1" to this page\'s URL (then reload) to see more information.</p>';
  250. 		}
  251. 		$out .= $this->failureInfo();
  252. 		$out .= '
  253. 			</body>
  254. 			</html>
  255. 		';
  256. 		echo $out;
  257. 	}
  258.  
  259. 	// analagous to wikiFilterAll(), but for special pages as $source.
  260. 	function SpecialFilterAll($source,$callback,$summary='',$callbackParams=array(),$metaReload=DELAY){
  261. 		$_GET['cache'] = $this->SpecialLinksCache($source); // load up the harvested links from the cache.
  262. 		$this->wikiFilterAll($source, $callback, $summary, $callbackParams, $metaReload);
  263. 	}
  264.  
  265. 	// analogous to wikiFilterAll(), but uses a user-defined array of links to process instead of harvesting the links from an existing page in the wiki
  266. 	function ArrayFilterAll($links,$callback,$summary='',$callbackParams=array(),$metaReload=DELAY){
  267. 		$_GET['cache'] = $this->ArrayLinksCache($links);
  268. 		$this->wikiFilterAll('User-provided array', $callback, $summary, $callbackParams, $metaReload);
  269. 	}
  270.  
  271. 	// analagous to wikiFilterAll(), but harvests links from a Category. You'll need to edit mediawiki/includes/CategoryPage.php and wrap
  272. 	// the list of category links in <div class="categoryItems">...</div> for this to work for you. Sorry.
  273. 	function CategoryFilterAll($source,$callback,$summary='',$callbackParams=array(),$metaReload=DELAY){
  274. 		$_GET['cache'] = $this->CategoryLinksCache($source); // load up the harvested links from the cache.
  275. 		$this->wikiFilterAll($source, $callback, $summary, $callbackParams, $metaReload);
  276. 	}
  277.  
  278. 	/* How FilterRecentChanges works:
  279. 	* 	Checks for previously harvested links. 
  280. 	*	If there are no previously harvested links:
  281. 		* 	Checks for $file, which will contain a timestamp ($cacheTime) indicating the last time that we harvested 
  282. 			  from Special:RecentChanges. It will also contain the name of the temp file ($cache) where we stored links after our last harvest.
  283. 			  For clarity, we call the harvested links file $cache and the temp data file $file
  284. 		* 	If no $file exists, then this is probably our first use of this bot. So we grab all changes from last 7 days to start with and harvest links from them into $cache.
  285. 		* 	Harvests article titles from Special:RecentChanges, removing duplicates (i.e. articles edited more than once since last we checked).
  286. 		* 	Examines only MAIN namespace (ns=0) by default. To do all, set $namespace=FALSE. 
  287. 		* 	Caches the harvested links into $cache, just like other filtering functions do.
  288. 	* 	With harvested links: Runs them all through a callback, just like wikiFilterAll does. When it's done, it deletes the temp file with the harvested links. So on next call, there won't be
  289. 		previously harvested links, and we start all over again.
  290. 	*/
  291. 	function FilterRecentChanges($callback,$editSummary='',$params=array(),$delay=DELAY,$namespace=0){
  292. 		$ns_plug = (false===$namespace) ? 'ALL' : $namespace;
  293. 		$file = CACHE . 'RCPatrol_' . $ns_plug . '_' . $callback . '.php'; // filename where we store some data we need
  294. 		$notFirstTime = $this->RecentChangesData($file); // true if file exists, false if file is created, die if false and file cannot be created
  295. 		require( $file ); // we now have $cacheTime and $cache
  296. 		if ($notFirstTime){
  297. 			if (file_exists( $cache )){	// if TRUE, we're continuing a job (i.e. we have a previously cached link harvest to work on)
  298. 				$_GET['cache'] = $cache;
  299. 			}else{	// we need a new link harvest
  300. 				$lastTime = $cacheTime; // copy $cacheTime before we overwrite it
  301. 				$this->RecentChangesData($file,true); // update $cache to reflect our new harvest
  302. 				require( $file ); // grab our new $cache (and $cacheTime)
  303. 				$_GET['cache'] = $this->HarvestRecentChanges( $lastTime, $cache, $namespace );
  304. 			}
  305. 		}else{	// apparently this is the first time we've used this bot. On first use, let's get the last 7 days worth of data.
  306. 			$cacheTime = date("YmdHis", mktime(gmdate("H"),gmdate("i"),gmdate("s"),gmdate("m"),gmdate("d")-7,gmdate("Y")));
  307. 			$_GET['cache'] = $this->HarvestRecentChanges( $cacheTime, $cache, $namespace );
  308. 		}
  309. 		$this->wikiFilterAll('Recent Changes Patrol', $callback, $editSummary, $params, $delay);
  310. 	}
  311.  
  312. 	/***************************************
  313. 	THIS IS THE END OF FUNCTIONS THAT YOU ARE LIKELY TO INTERACT WITH
  314.  
  315. 	THE REST OF THE CLASS IS  UTILITY FUNCTIONS THAT YOU PROBABLY WON'T INTERACT WITH
  316. 	****************************************/
  317. 	// You shouldn't ever need to call this. Call wikiConnect instead if you need to do something that requires a log in.
  318. 	// in fact, you can't call it first; wikiConnect needs to define $this->wikiCookies before this will work right.
  319. 	function wikiLogin(){
  320. 		unset($this->cookies);
  321. 		$vars['wpName'] = $this->wikiUser;
  322. 		$vars['wpPassword'] = $this->wikiPass;
  323. 		$vars['wpRemember'] = 1;
  324. 		$vars['wpLoginattempt'] = "Log+in";
  325. 		$loginUrl = $this->wikiServer . PREFIX . '/index.php?title=Special:Userlogin&amp;action=submitlogin&amp;type=login';
  326. 		if ($this->submit($loginUrl,$vars)){
  327. 			/* 	okay, our 4 cookies will be now be in $this->cookies as an array. They look something like this (don't try hacking my site; I changed these):
  328. 				    [wikisum__session] => gb6b4s4u6aj9prifqla73pn096
  329. 				    [wikisum_UserID] => 2
  330. 				    [wikisum_UserName] => Botusername
  331. 				    [wikisum_Token] => efd9573b9255c93bcfee1d8a990de617
  332. 				Now we need to store this information somewhere.
  333. 			*/
  334. 			if (is_array($this->cookies)){
  335. 				$cookiesCache = '<?php $cookiesCache = array( '; // yeah, I know, I could have just used "serialize". Sue me.
  336. 				foreach( $this->cookies as $name => $value )
  337. 					$cookiesCache .= "'$name' => '$value',";
  338. 				$cookiesCache .= '); ?>';
  339. 				if (file_put_contents($this->wikiCookies,$cookiesCache) )
  340. 					$this->wikiConnected = true; // we have the cookies and we've cached them successfully. Proceed.
  341. 			}
  342. 		}
  343. 		return $this->wikiConnected; // We've got 3 IFs up there, and if any one fails, this will be true.
  344. 	}
  345.  
  346. 	function failureInfo(){
  347. 		$out .= '<ul>';
  348. 		$out .= '<li>' . $this->response_code . '</li>';
  349. 		$out .= '</ul>';
  350. 		if ($_GET['showall'])
  351. 			$out .= $this->results;
  352. 		return $out;
  353. 	}
  354.  
  355. 	// you probably want wikiLinks(), not wikiAllLinks(). This is just here as an example.
  356. 	// returns an array of all the links that a particular article has. Ignores links in the sidebar; only includes links in the actual article.
  357. 	// WILL include external links and links from the table of contents and from any included templates. If you don't want this, use $this->wikiLinks() instead.
  358. 	function wikiAllLinks($title){
  359. 		$this->fetchlinks( $this->wikiServer . PREFIX . '/index.php?title=' . $title . '&action=render' );
  360. 		var_dump( $this->results );
  361. 	}
  362.  
  363. 	// you probably won't ever need to call this directly. Try wikiFilterAll() instead.
  364. 	// Fetches $source; sends the results to wikiLinks(), which returns an array of all internal links (excluding category links by default).
  365. 	// Prettifies the returned array, then stores to cache. Returns the cache filename.
  366. 	// doesn't work if $source is a special page. Use HarvestSpecialLinks() instead.
  367. 	function wikiHarvestLinks($source,$stripCats=true){
  368. 		//if (!$this->wikiConnect())	// uncomment these two lines if logging in is required to READ (not edit) the page you're trying to scrape.
  369. 		//	die( "Unable to connect." );
  370. 		if (!$this->fetch( $this->wikiServer . PREFIX . '/index.php?title=' . $source . '&action=raw' ))
  371. 			return false;
  372. 		$links = $this->wikiLinks($this->results,$stripCats);
  373. 		if (!is_array($links))
  374. 			die( 'Cannot harvest links from an article that has no links.' );
  375. 		foreach( $links as $key=>$link )		
  376. 			$links[$key] = $link[1]; // remember that we're dealing with the ugly array in wikiLinks(). Let's simplify it a bit.
  377. 		$cache = TEMP . 'harvest_' . gmdate("Ymd_his") . '.php';
  378. 		$this->UpdateLinksCache( $cache, $links );
  379. 		return $cache; // return the file path we used
  380. 	}
  381.  
  382. 	// analogous to wikiHarvestLinks, but for special pages. Special pages don't accept the ?action=raw trick we use in wikiHarvestLinks, so we use this method instead.
  383. 	// like wikiHarvestLinks, stores the harvested links to cache and returns the cache filename. Doesn't work on all types of special pages (must use <ol class="special">...</ol>)
  384. 	function HarvestSpecialLinks($title){
  385. 		if (!$this->fetch( $this->wikiServer . PREFIX . '/index.php?title=' . $title ) )
  386. 			return false;
  387. 		// we want only the links in here:	<ol start='1' class='special'> ... </ol>
  388. 		preg_match( "|<ol[^>]*class=['\"]special['\"][^>]*>(.*)</ol[^>]*>|Usi",$this->results,$specialLinks );
  389. 		if (!is_array($specialLinks))
  390. 			return false;
  391. 		$specialLinks = $specialLinks[0]; // the whole thing from <ol... to </ol>
  392. 		$specialLinks = $this->_striplinks($specialLinks);
  393. 		// Special pages use relative (not absolute) links. E.g. if ALTPREFIX is '/w', then you'll have things like '/w/Article_Title' in the special page's links.
  394. 		// we want to reduce that to just the title--i.e. strip off the '/w/' part.
  395. 		if (''!=ALTPREFIX){ 	$strlen = strlen(ALTPREFIX) + 1;}  		// the "+1" is for the slash after ALTPREFIX
  396. 		else{					$strlen = 1;}						 	// for the slash
  397. 		foreach( $specialLinks as $key => $link )
  398. 			$specialLinks[$key] = substr( $link, $strlen );
  399. 		// now let's prepare to write the links to a temporary location
  400. 		$cache = TEMP . 'harvest_' . gmdate("Ymd_his") . '.php';
  401. 		$this->UpdateLinksCache($cache,$specialLinks);
  402. 		return $cache; // we're returning the file path
  403. 	}
  404.  
  405. 	// you get the idea by now. Analogous to the last couple. Note that you'll need to edit theme files to use this.
  406. 	function HarvestCategoryLinks($title){
  407. 		if (!$this->fetch( $this->wikiServer . PREFIX . '/index.php?title=' . $title  ) )
  408. 			return false;
  409. 		// you'll need to edit mediawiki/includes/CategoryPage.php and wrap all the category links in <div class="categoryItems"> .... </div> so that we can find them.
  410. 		preg_match( "|<div[^>]*class=['\"]categoryItems['\"][^>]*>(.*)</div[^>]*>|Us",$this->results,$catLinks );
  411. 		if (!is_array($catLinks))
  412. 			return false;
  413. 		$catLinks = $catLinks[0]; // the whole thing from <div... to </div>
  414. 		$catLinks = $this->_striplinks($catLinks);
  415. 		// Category pages use relative links. E.g. if ALTPREFIX is '/w', then you'll have things like '/w/Article_Title' in the special page's links.
  416. 		// we want to reduce that to just the title--i.e. strip off the '/w/' part.
  417. 		if (''!=ALTPREFIX){ 	$strlen = strlen(ALTPREFIX) + 1;}  		// the "+1" is for the slash after ALTPREFIX
  418. 		else{					$strlen = 1;}						 	// for the slash
  419. 		foreach( $catLinks as $key => $link )
  420. 			$catLinks[$key] = substr( $link, $strlen );
  421. 		// now let's prepare to write the links to a temporary location
  422. 		$cache = TEMP . 'harvest_' . gmdate("Ymd_his") . '.php';
  423. 		$this->UpdateLinksCache($cache,$catLinks);
  424. 		return $cache; // we're returning the file path
  425. 	}
  426.  
  427. 	// pass wikiLinks() raw wiki code (not HTML). It will find all the internal links and return them as a big ugly array. See the comments in the function for details.
  428. 	// does not work if $title isn't editable (e.g. special pages). Try HarvestSpecialLinks instead.
  429. 	// USED BY OTHER BOTS, SO DON'T MODIFY. (e.g. used by RecentChangesBot) (that's a note to myself; you can edit it if you want, of course)
  430. 	function wikiLinks($content,$stripCats=false){	// set $stripCats to TRUE if you don't want categories returned.
  431. 		// we need to find the following patterns:	[[article title|link text]] 		or 		[[article title]]
  432. 		preg_match_all("~(?<!\[)(?:\[{2})(?!\[)([^|\]]+)\|?([^\]]*)\]{2}(?!])~",$content,$internal,PREG_SET_ORDER);
  433. 		/* 	(?<!\[)(?:\[{2})(?!\[)		ensure we have two (and only two) [[ at the beginning
  434. 			([^|\]]+)			match anything up until a | or ]. This will grab the title that we're linking to.
  435. 			\|?				allows for a | (which separates article title from link text)
  436. 			([^\]]*)			match anything other than ] (this is the link text)
  437. 			\]{2}(?!])			ensure that we close with two (and only two) ]]
  438.  
  439. 			Suppose the article contains these links: 	blah blah [[Link 1|link text]] blah blah blah [[Link 2]] blah blah
  440. 			$internal will look something like this. Note that each array element is a separate link that we found.
  441. 			    [0] => Array
  442. 			            [0] => [[Link 1|link text]]			the link exactly as it was written in the article
  443. 			            [1] => Link 1				the title being linked to
  444. 			            [2] => link text				the link text, if any
  445. 			    [1] => Array
  446. 			            [0] => [[Link 2]]
  447. 			            [1] => Link 2
  448. 			            [2] => 					this is empty, since link 2 doesn't have link text.
  449. 		*/
  450. 		if ( (!is_array($internal)) || (0 == count($internal)) )
  451. 			return false;
  452. 		if ($stripCats){	// strip out all category links. These always start with [[Category: so they are easy to find.
  453. 			foreach($internal as $key=>$link)
  454. 				if (2==strpos($link[0],'Category:')){unset($internal[$key]);}
  455. 			if (0 < count($internal))
  456. 				$internal = array_values($internal); // renumber the keys, just for kicks.
  457. 			else
  458. 				return false;
  459. 		}
  460. 		// make sure we replace all spaces with underscores in link targets. (This is something I did for myself to make one of my bots work smoother. You can delete if you want.)
  461. 		foreach( $internal as $key=>$link ){
  462. 			$internal[$key][1] = str_replace(' ', '_', $link[1]); // don't change, used by recentchangesBot (note to self)
  463. 		}
  464. 		return $internal; // will be an array with at least one element.
  465. 	}
  466.  
  467. 	// if you wish to provide an array of (internal) links rather than harvesting links, send them here. It will cache them and return the cache path.
  468. 	// you don't really need to call this, though; it gets called by ArrayFilterAll() automatically.
  469. 	function ArrayLinksCache($links){
  470. 		if (!$_GET['cache']){ // okay, this is our first load.
  471. 			$_GET['cache'] = TEMP . 'linkarray_' . gmdate("Ymd_his") . '.php';
  472. 			if (!is_array($links))
  473. 				die ('You need to provide a valid link array.');
  474. 			$this->UpdateLinksCache( $_GET['cache'], $links );
  475. 		}
  476. 		return $_GET['cache'];
  477. 	}
  478.  
  479. 	// takes an array full of links and stores them to a flat file cache. Returns TRUE on success, dies on failure.
  480. 	function UpdateLinksCache($cache,$links){
  481. 		if (!is_array($links))
  482. 			die( 'You are trying to update the links cache, but you passed no links.' );
  483. 		$filebody = '<?php $links = array( ';	 // i know, i know, i should have used serialize()
  484. 		foreach( $links as $link )
  485. 			$filebody .= "\n\t'" . addslashes($link) . "', ";
  486. 		$filebody .= '); ?>';
  487. 		if (file_put_contents($cache,$filebody) )
  488. 			return true;
  489. 		die ( 'Unable to write to temporary directory.' );
  490. 	}
  491.  
  492. 	// loads up an array of links from a flat file cache created by UpdateLinksCache(). Returns links as an array.
  493. 	function LoadLinksCache($cache){
  494. 		require_once( $cache );
  495. 		$links = stripslashes_array( $links );
  496. 		if (!is_array($links))
  497. 			die('There should be an array called "links" in the cache, but it is not there.');
  498. 		return $links;
  499. 	}
  500.  
  501. 	// a utility function. You probably want SpecialFilterAll, not this.
  502. 	// figures out where we should be getting our cached links from. Run this function's return value through $this->LoadLinksCache(), then the links will be returned as an array.
  503. 	function SpecialLinksCache($source){
  504. 		if (!$_GET['cache'])	// the <meta> reload will append ?cache= to the current URL. If it's there, then we know we've already harvested the links. If not, start by harvesting.
  505. 			$cache = $this->HarvestSpecialLinks($source);	// harvest the links from $source and store them to $cache
  506. 		else
  507. 			$cache = $_GET['cache'];
  508. 		if ($cache)
  509. 			return $cache; // returns the filename we should use.
  510. 		else
  511. 			die( 'Unable to harvest links.' );
  512. 	}
  513.  
  514. 	// like SpecialLinksCache(), but for category pages. You'll need to edit includes/CategoryPage.php for this to work. Sorry.
  515. 	function CategoryLinksCache($source){
  516. 		if (!$_GET['cache'])
  517. 			$cache = $this->HarvestCategoryLinks($source);
  518. 		else
  519. 			$cache = $_GET['cache'];
  520. 		if ($cache)
  521. 			return $cache; // returns the filename we should use.
  522. 		else
  523. 			die( 'Unable to harvest links.' );
  524. 	}
  525.  
  526. 	// returns TRUE if $filename exists, FALSE if it doesn't and we successfully create it, DIES if we false and we cannot create it.
  527. 	function RecentChangesData( $filename, $force=false ){
  528. 		if (!$force && file_exists( $filename ) )
  529. 			return true;
  530. 		$cacheTime = gmdate("YmdHis"); // e.g. 20070801231902 for Aug 1, 2007, 23:19:02 (GMT)
  531. 		$cache = TEMP . 'harvest_' . gmdate("Ymd_His") . '.php'; // where we'll store harvested links
  532. 		$filebody = '<?php $cacheTime = "'.$cacheTime.'";' . "\n" . '$cache = "'.$cache.'"; ?>';
  533. 		if (file_put_contents($filename, $filebody) )
  534. 			return false;
  535. 		die ( 'Unable to store Recent Changes data' );
  536. 	}
  537.  
  538. 	function HarvestRecentChanges( $from, $cache, $ns=false ){
  539. 		$title = 'Special:Recentchanges&from='.$from.'&hidemyself=1&hidepatrolled=0&limit=5000';
  540. 		if (false!==$ns) // use FALSE with ===, since $ns might be 0 (for main namespace)
  541. 			$title .= '&namespace='.$ns;
  542. 		if (!$this->fetch( $this->wikiServer . PREFIX . '/index.php?title=' . $title ) )
  543. 			return false;
  544. 		// we want only the links enclosed in one of the '<ul class="special"' tags. The <ul> starts over for each date displayed in recent changes
  545. 		preg_match_all("~<ul[^>]*class=['\"][^'\"]*special[^'\"]*['\"][^>]*>(.*)</ul[^>]*>~Usi",$this->results,$linklist);
  546. 		/* gives us something like this:
  547. 			[0]
  548. 				[0] most recent day's recent changes (as an HTML list wrapped in the <ul class="special">...</ul> tags used in the preg_match_all)
  549. 				[1] previous day's recent changes
  550. 				[2] etc
  551. 			[1]	same thing as [0], but now each element lacks the <ul class="special">wrap</ul>
  552. 		*/
  553. 		if (!is_array($linklist[1]))
  554. 			return false;
  555. 		if (1 == count($linklist[1])) // we have only 1 day's data
  556. 			$links = $linklist[1][0]; // convert to string.
  557. 		else
  558. 			foreach ($linklist[1] as $day){ $links .= $day; } // convert to a string. We use $linklist[1] b/c we don't want the <ul> wrappers.
  559. 		unset( $linklist ); // done with it.
  560. 		// now the trick is extracting all the article titles from $links. Each entry in recent changes has several links: Diff, Hist, article, User, User talk.
  561. 		// Perhaps the easiest place to get the page title is from each entry's "history" link.
  562. 		preg_match_all('~<a href="'.PREFIX.'/index\.php\?title=([^&"]*)[^"]*action=history[^>]*>~',$links,$links);
  563. 		/*	About the regex:
  564. 				([^&"]*)	// ensures we grab only the title, not additional URL parameters (like &curid=
  565. 				[^"]*		// here we allow for additional parameters between the title and the "action=history" bit.
  566. 			gives us something like this:
  567. 			[0]
  568. 				[1]  <a href="/wiki/index.php?title=Fowler:_Habitual_voting&amp;curid=2010&amp;action=history" title="Fowler: Habitual voting">hist</a>
  569. 				[2] additional links
  570. 			[1]
  571. 				[1] Fowler:_Habitual_voting
  572. 				[2] additional titles
  573. 		*/
  574. 		if (!is_array($links[1]))
  575. 			return false;
  576. 		if (0==count($links[1]))
  577. 			die ('I tried to harvest links from "Recent Changes," but it looks like nothing has been changed since last I checked.');
  578. 		$links = array_unique( $links[1] ); // we want only the titles, not the links. And we don't want duplicates.
  579. 		$links = array_reverse( $links ); // now the most recently changed article is at the end of the array. We want to process from oldest to newest.
  580. 		// if there are pages that you never want "recent changes" bots to fiddle with, add them to this array:
  581. 		$hands_off = array( 'Main_Page' );
  582. 		$links = array_diff( $links, $hands_off );
  583. 		if (0==count($links))
  584. 			die ('I tried to harvest links from "Recent Changes," but it looks like nothing has been changed since last I checked.');
  585. 		// alright, we've got our links array. Let's write it to $cache.
  586. 		$this->UpdateLinksCache( $cache, $links );
  587. 		return $cache;
  588. 	}
  589. }
  590. ////////////////////// END OF THE CLASS ////////////////////////
  591.  
  592.  
  593.  
  594.  
  595. // a couple utility functions to make life easier.
  596. function print_debug($v){
  597. 	echo '<pre>';
  598. 	if (is_array($v))
  599. 		print_r($v);
  600. 	elseif (is_string($v))
  601. 		print htmlspecialchars($v);
  602. 	else
  603. 		var_dump($v);
  604. 	echo '</pre>';
  605. }
  606. function inString($haystack,$needle,$insensitive=false){ // php should really have a function like this built in...
  607. 	if ($insensitive){ // case-insensitive check
  608. 		if (false!==stripos($haystack,$needle))
  609. 			return true;
  610. 	}else{
  611. 		if (false!==strpos($haystack,$needle))
  612. 			return true;
  613. 	}
  614. 	return false;
  615. }
  616. function stripslashes_array($arr){
  617. 	if (!is_array($arr))
  618. 		die ('You think you passed stripslashes_array an array, but you did not.');
  619. 	$out = array();
  620. 	foreach( $arr as $key => $el ) // guess I could have just used array_map() instead...
  621. 		$out[$key] = stripslashes( $el );
  622. 	return $out;
  623. }
  624. // Checks for wikification. Looks only for basic formatting: Headings, bold, italics, and links (if requested). TRUE if found, FALSE otherwise.
  625. function isWikified($content,$checkLinks=false){
  626. 	if (preg_match("|\n==([^=]+)==|U",$content))
  627. 		return true;
  628. 	if (preg_match("|\n===([^=]+)===|U",$content))
  629. 		return true;
  630. 	if (preg_match("|''([^']+)''|U",$content)) // will find both bold and italic
  631. 		return true;
  632. 	// some articles will contain links but no other formatting. We check for links as evidence of wikfication only if requested.
  633. 	if ($checkLinks){	// THESE TWO REGULAR EXPRESSIONS ARE NOT TESTED YET
  634. 		if (preg_match("|\[{2}([^\[\]]+)\]{2}|",$content)) // internal links
  635. 			return true;
  636. 		if (preg_match("|\[http://([^\[\]]+)\]|",$content)) // external links
  637. 			return true;
  638. 	}
  639. 	return false;
  640. }
  641. function checkBadWords($content,$badwords=false){ // returns TRUE if we detect bad words, FALSE otherwise.
  642. 	// if $content contains ANY of the strings in $bad, we return true. Searching is case-insensitive. Note that partial strings will also match.
  643. 	// if you put "hell" in your bad list, "Michelle" will set it off. To avoid this, use spaces around the word: " hell "
  644. 	// $bad can be overridden by $badwords.
  645. 	$bad = array( ' fuck ', ' shit ', ' pussy ', ' bitch ', ' asshole ', ' fucker ' );
  646. 	$bad = $badwords ? $badwords : $bad;
  647. 	foreach( $bad as $b ){
  648. 		if ( inString($content,$b,true) )
  649. 			return true;
  650. 	}
  651. 	return false;
  652. }
  653.  
  654.  
  655.  
  656.  
  657. /**************************
  658. SAMPLE CALLBACK FUNCTIONS
  659. **************************/
  660. /*	callback "addTemplate" receives $content from wikiFilter(). You can set the following parameters by passing an array to $args:
  661. *	$args['template'] = '{{name of template}}';	// actually, the value can be anything you want to insert, like a category or something. It doesn't have to be a template. Templates need {{braces}}.
  662. *	$args['toBottom'] = true;		// if TRUE, the template will be inserted at the end. Otherwise, will be inserted at the beginning. 		*/
  663. function addTemplate($content,$args){
  664. 	if (!is_array($args))
  665. 		return $content; // do nothing.
  666. 	extract($args); // $args should be an array( 'template' => '{{name of template}}' )
  667. 	if (''==$template)
  668. 		die ('You didn\'t pass enough arguments to "addTemplate()"');
  669. 	if (inString($content,$template)) // don't add the template twice
  670. 		return $content;
  671. 	if ($toBottom)
  672. 		$content .= "\n\n" . $template; // template at bottom of page
  673. 	else
  674. 		$content = $template . "\n\n" . $content; // template at top of page
  675. 	return $content;
  676. }
  677. /* 	callback addCategory is almost identical to addTemplate. Pass it an array with this arg:
  678. 	$args['cat'] = 'Category_Name';	// you don't need to put brackets around the category name. And don't precede it with "Category:" either. Just the name, please.		*/
  679. function addCategory($content,$args){
  680. 	if (!is_array($args))
  681. 		return $content; // do nothing
  682. 	extract($args);
  683. 	if (''==$cat)
  684. 		die( "You didn't pass valid parameters to 'addCategory()'" );
  685. 	$cat = '[[Category:'.$cat.']]';
  686. 	if (inString($content,$cat))	// don't add the category twice
  687. 		return $content;
  688. 	$content = trim($content);
  689. 	if (''==$content) // this probably means that we're adding a category to a category page without any text at the top
  690. 		$content = $cat;
  691. 	elseif (inString($content, '[[Category:'))
  692. 		$content .= " $cat"; // if there are already cats, they're probably at the end. Let's just tack one on.
  693. 	else
  694. 		$content .= "\n\n$cat"; // if there aren't already cats, let's be neat and put this on a new line at the end.
  695. 	return $content;
  696. }
  697.  
  698. function addTransclude($content,$args){
  699. 	if (!is_array($args))
  700. 		return $content; // do nothing.
  701. 	extract($args); // $args should be an array( 'template' => '{{name of template}}' )
  702. 	if (''==$template)
  703. 		die ('You didn\'t pass enough arguments to "addTransclude()"');
  704. 	if (inString($content,$template)) // don't add the template twice
  705. 		return $content;
  706. 	if ($toBottom)
  707. 		$content .= "\n\n" . $template; // template at bottom of page
  708. 	else
  709. 		$content = $template . "\n\n" . $content; // template at top of page
  710. 	return $content;
  711. }
  712.  
  713.  
  714.  
  715.  
  716.  
  717. #############################################################
  718. ##################### DEMOS #################################
  719. #############################################################
  720.  
  721. // A sample callback for use with wikiFilter() (see also the sample callbacks above). This is completely useless other than to demonstrate adding and removing categories.
  722. // It will toggle whether a page has the "robots" category:
  723. function dumbFilter($content){
  724. 	if (inString($content,'[[Category:Robots]]'))		// if the article already has the category...
  725. 		$content = str_replace( '[[Category:Robots]]', '', $content); // remove it.
  726. 	else 				// otherwise...
  727. 		$content .= ' [[Category:Robots]]';	// add it.
  728. 	return $content;	// CRUCIAL. Return the edited content back to wikiFilter().
  729. }
  730.  
  731. // another sample filter. This one adds any text we specify to the end of the summary.
  732. // $params needs to be passed as an array that looks like this:		array( 'newtext' => 'Add me to the end of the article.' );
  733. function dumbFilter2($content,$params){
  734. 	extract($params); // produces $newtext
  735. 	$content .= $newtext;	// add $newtext to the end of the article
  736. 	return $content;		// CRUCIAL. Return the edited content back to wikiFilter().
  737. }
  738.  
  739. ############################################################
  740. ############################################################
  741. /* 	SAMPLE USAGE
  742.  
  743. 	Initiate our class:
  744. 		$myBot = new BasicBot();
  745.  
  746. 	DEMO 1: Filtering a single article using a callback that takes no additional parameters:
  747. 		Run the content of 'Project:Sandbox' through function dumbFilter(). Leave edit summary 'Testing a bot.'
  748. 			$myBot->wikiFilter( 'Project:Sandbox', 'dumbFilter', 'Testing a bot.' );
  749.  
  750. 	DEMO 2: Filtering a single article using a callback that does take additional parameters (must pass params as an array):
  751. 		Run the content of 'Project:Sandbox' through function dumbFilter2(). Send parameter $newtext="Wuzzup?". Leave edit summary 'Testing another bot.'
  752. 			$myBot->wikiFilter( 'Project:Sandbox', 'dumbFilter2', 'Testing another bot.', array('newtext'=>'Wuzzup?') );
  753.  
  754. 	DEMO 3: Applying a filter to a whole bunch of articles is just as easy.
  755. 		Run the content of all articles linked to by 'Project:Sandbox' through dumbFilter(), leaving edit summary 'Testing a bot on lots of pages' on each affected article:
  756. 			$myBot->wikiFilterAll( 'Project:Sandbox', 'dumbFilter', 'Testing a bot on lots of pages' );
  757.  
  758. 	DEMO 4: It's just as easy if we're using a callback that accepts parameters. Let's repeat the preceding example, but now we'll use dumbFilter2 and pass $newtext='Wuzzup?'
  759. 		$myBot->wikiFilterAll( 'Project:Sandbox', 'dumbFilter2', 'Testing a bot on lots of pages', array('newtext'=>'Wuzzup?') );
  760.  
  761. 	And that's that.
  762.  
  763. 	Note that wikiFilter() and wikiFilterAll() take a couple additional arguments if you want. Look at the function definitions and you can figure it out.
  764. 	Also note that you don't have to scrape 'Project:Sandbox' for links. Look in the class for SpecialFilterAll() and ArrayFilterAll().
  765. */
  766. ############################################################
  767. ############################################################
  768. ?>
Navigation
Volunteer Tools
Other Wikis
Toolbox