Saturday, April 11, 2009

Google Ad-Sense Pdf's

This is about a series of Pdf's that I found on Google's UK Server which is about Case Studies of Ad-Sense Program.
I was reading about Configuring robots.txt File from this URL.

Then Began a Series of Re-Direction of URL's (http://google.co.in/robots.txt -->http://www.google.com/sitemaps_webmasters.xml-->http://www.google.co.uk/intl/en/adtoolkit/pdfs/pdf_sitemap.txt)until i reached out for a URL which contained a list of Pdf's.
I was too curious to find the Contents of all Pdf's in a single go.That is when i realized that I could download-them-all (No Pun Intended) real quick if I can use Shell scripts to semi-automate the download Process.

The Process I followed :
1) Copied the Contents of a txt file to a file on the system.

2)I just had to insert a Firefox before each URL.This was the Difficult Part since I was really awful at tweaking shell scripts to achieve that. Instead I wrote a Quick C++ Code to do the same with File Handling(It was guaranteed to work too ).
The C++ Code for the same ::




#include <iostream>
#include <fstream>
using namespace std;

int main() {
/*
* links.txt is the Original File that Contained the list of URL's.
*/
ifstream in("links.txt");
/*
* link.sh is the shell Script which was to automate the Process.
*/
ofstream out("link.sh");

string s = "";
while (getline(in, s)) {
// This line enabled the Insertion of firefox before each URL.
out << "firefox " << s << endl;
}
in.close();
out.close();
return 0;
}




3) I had disabled all Prompts for Download in the Preferences Tab. So, it would Download all of these pdfs without prompting me ;).After this, I just had to run the Script from the Terminal and Voila!! All Pdf's were downloaded within 1-2 minutes.

4) Later, I had to move the pdf's from default download directory to a different Directory.

5) Finally, using wc-l/Word Count for lines,I found out that I had downloaded around 167 pdf's (some of them being confidential :P ). Since it is accessible for public viewing .. it doesn't matter.

I did all these steps within a matter of around 4-5 minutes which is real-quick for me.
The Entire Process was real fun.. In case you had come across such quick-tweak then do post about it..

PS:
1) Advanced Shell Script Tweaking is a definite To Do Thing next Semester.

2) wget was a better option than Firefox :( .. I am on the learning Curve ... Trying to Grasp Few things..

4 comments:

Abhishek said...

is the effect similar to downloading all the files paralelly without explicitly downloading each file by a download manager because we have options like download all links in some download manager but we don't have any filters like to download only the pdf links and hence it could be a very good improvement.

Srinivas Iyengar said...

Well There is no Parallel Download in either the shell script I used or using wget.
But then wget has an attribute to mention the format of the files to be downloaded and using -r (Recursive) we can download all the related files.
But the Obvious Disadvantage is that all Folders don't have Permissions to even access them.So,It works for our College Site but not for Google or anyother Sites which are smart enough :D.

Anonymous said...

You could have tried proz or axel. They are multithreaded command download line download managers.

Srinivas Iyengar said...

I didn't have Axel in my System at that time and I needed a Quick-Fix solution for it within 10 minutes.