Home

Short Description

Abspieler provides you with the ability to retrieve data from foreign web sites in an easy, but flexible way, even if the data is spread across a multiplicity of documents and even if those documents are dynamically built. Abspieler supports HTTP as well as HTTPS, GET as well as POST (even in conjunction with parameters), relocations (even META refreshs) and referrers. And it's automatically managing cookies!

Download latest Version (Version 0.0.174; 10/Dez/2007)

Prerequisites

To use Abspieler, you need the following libraries: The easiest way to install them is to download the Jar files from the given web sites and put it into the directory "lib/ext" of your Java installation.

More resources

Background

Abspieler is using the same elements as a normal Browser:

Simple usage

Abspieler is a powerful and flexible tool, but it's also easy to use for beginners.
  1. Import Abspieler:

    import de.intags.abspieler.*;

  2. Open the browser:

    Browser browser = new Browser();

  3. Open the first window (you can open as many windows as you want, that's why they need a name):

    Window window = browser.newWindow("MyBrowserWindow");

  4. Create the first document class and make it point to an URL:

    Document document = window.newDocument("http://www.intags.de/abspieler/demo/demo1.html");

  5. Retrieve the document:

    document.retrieve();

  6. Use the data:

    System.out.println(document.getResponse().getContent());
Addition: Most of the methods might throw an AbspielerException, so your own methods have to catch or throw them, too.

Example:

package YourOwnPackage;

import de.intags.abspieler.*;

public class Demo1 {
	public static void main(String[] args) throws AbspielerException {
		Browser browser = new Browser();
		Window window = browser.newWindow("MyBrowserWindow");
		Document document = window.newDocument("http://www.intags.de/abspieler/demo/demo1.html");
		document.retrieve();
		System.out.println(document.getResponse().getContent());
	}
}	

Expectations

If you want to retrieve data from a website, you expect this data to be in a certain context. The data must have a special format, must be embedded in a website with a certain title or content, etc. So only web pages that match your expectations can contain the data you're looking for.

For example: If you're looking for stock quotes, you might expect the stock quotes to be on a page with a title containg the word "quote", and the stock quote has to be in a certain format, maybe a table with the ISIN on the left and the price on the right, for example this one: http://www.intags.de/abspieler/demo/demo2.html. So, you are looking for HTML code similar to:

		<tr>
			<td>DE12345678</td>
			<td>Expensive Inc.</td>
			<td>538.34</td>
		</tr>

Pages that are not matching those expectations are not interesting to you. Normally, you don't want to analyze them. But if a page matches, you can retrieve the data very simple using methods of Abspieler.
If you also want to analyze pages that don't match your expectations, you can do that by simply analyzing the complete response content (see above).

  1. Prepare the application as before:

    Browser browser = new Browser();
    Window window = browser.newWindow("MyBrowserWindow");
    Document document = window.newDocument("http://www.intags.de/abspieler/demo/demo2.html");

  2. Define your expecations. You don't have to use all of them, only the ones that you really want:

    Expectation expectation_quotes = new Expectation("QuotesExpectation");

    // Make sure that there was no error while retrieving this page.
    expectation_quotes.addPossibleStatus(200);

    // Make sure that this page has the correct content type. You can check for any header you like.
    expectation_quotes.addNeededHeader("Content-type", "text/html");

    // Only pages with the word "quote" in the title are relevant.
    // The first argument is the name that you will need for getting the data.
    // This has to be a regular expression including all HTML tags.
    expectation_quotes.addNeededContent("title", "<title>.*quote.*</title>");

    If you don't know how to use regular expressions, read the SUN tutorial.

  3. Define what you are looking for. It's no big difference to the one before. But if you have a closer look on the regular expression, you will recognize, that there are round brackets. Round brackets have a special meaning in regular expressions. The content of a round bracket can be retrieved more easily. See next steps for details.

    expectation_quotes.addNeededContent("quotes",
            "<tr>[^<]*<td>([^<]*)</td>[^<]*<td>([^<]*)</td>[^<]*<td>([^<]*)</td>[^<]*</tr>");

  4. Add the expectation to your document definition

    document.addExpectation(expectation_quotes);

  5. Retrieve the document as before:

    document.retrieve();

  6. Check if the document is, what you expected. To do this, check if the category of the document is the same as your expectation. If the document does not conform to any of your expectations, the category is null.

    CategorizedResponse catResponse = document.getCategorizedResponse();
    if (catResponse.getExpectation() != null &&
            catResponse.getExpectation().getExpectationName().equals("QuotesExpectation")) {

  7. Process all quotes using a loop:

    for (int i = 0; i < catResponse.getMatchCount("quotes"); i ++)
  8. Get all items in this match. Bracket 1 is represented by item 1, bracket 2 by item 2 and so on. Item 0 represents the complete String.

    System.out.println("FOUND:");
    System.out.println("\tComplete String: " + catResponse.getMatch("quotes", 0, i));
    System.out.println("\tISIN:" + catResponse.getMatch("quotes", 1, i));
    System.out.println("\tName:" + catResponse.getMatch("quotes", 2, i));
    System.out.println("\tPrice:" + catResponse.getMatch("quotes", 3, i));
    System.out.println();

Example:

package YourOwnPackage;

import de.intags.abspieler.*;

public class Demo2 {
	public static void main(String[] args) throws AbspielerException {
		Browser browser = new Browser();
		Window window = browser.newWindow("MyBrowserWindow");
		Document document = window.newDocument("http://www.intags.de/abspieler/demo/demo2.html");
		
		Expectation expectation_quotes = new Expectation("QuotesExpectation");
		
		// Make sure that there was no error while retrieving this page.
		expectation_quotes.addPossibleStatus(200);
		
		// Make sure that this page has the correct content type. You can check for any header you like.
		expectation_quotes.addNeededHeader("Content-type", "text/html");
		
		// Only pages with the word "quote" in the title are relevant.
		// The first argument is the name that you will need for getting the data.
		// This has to be a regular expression including all HTML tags.
		expectation_quotes.addNeededContent("title", "<title>.*quote.*</title>");
	
		expectation_quotes.addNeededContent("quotes",
			"<tr>[^<]*<td>([^<]*)</td>[^<]*<td>([^<]*)</td>[^<]*<td>([^<]*)</td>[^<]*</tr>");
			
		document.addExpectation(expectation_quotes);
		
		document.retrieve();
		
   		CategorizedResponse catResponse = document.getCategorizedResponse();
		if (catResponse.getExpectation() != null &&
				catResponse.getExpectation().getExpectationName().equals("QuotesExpectation")) {		
    
			for (int i = 0; i < catResponse.getMatchCount("quotes"); i ++) {
				System.out.println("FOUND:");
				System.out.println("\tComplete String: " + catResponse.getMatch("quotes", 0, i));
				System.out.println("\tISIN:" + catResponse.getMatch("quotes", 1, i));
				System.out.println("\tName:" + catResponse.getMatch("quotes", 2, i));
				System.out.println("\tPrice:" + catResponse.getMatch("quotes", 3, i));
				System.out.println();
System.out.println(); } } } }

More examples

Coming soon...
Home

10/Dez/2007