It seems that you're using an outdated browser. Some things may not work as they should (or don't work at all).
We suggest you upgrade newer and better browser like: Chrome, Firefox, Internet Explorer or Opera

×
Hiya, im self-learning C++ at the moment and i was wondering, is it possible to create a program that can scan over websites details (Text, pargraphs, sentences etc) and look for updates to that website?
Not in C++ I believe. I took a C++ programming class a couple years ago, but the programs you wrote were entirely DOS operated and generally for simple tasks like math, and experimenting with pathways and file creation/storage. You'd need something more complicated like Java Script, which is what I think my friend uses (He's finishing a bachelor's degree in computer engineering.) As far as finding updates though.... what's wrong with the refresh button?
My knowledge of C++ is not the best, but I'm sure you can do that. As to the difficulty of such a venture, that's a different matter.
avatar
QC: You'd need something more complicated like Java Script, which is what I think my friend uses (He's finishing a bachelor's degree in computer engineering.)
I think you mean Java, which can also do it and, I think, is a bit simpler to do than it is in C++.
Post edited February 09, 2012 by adambiser
You can do this in basically ANY language. As for what is most appropriate, it probably depends on the format you want. The simplest solution I can think of is a bash script which periodically gets the HTML, and uses diff to display the results. This is probably not a friendly as you want.

The most elegant way I think would be to use a server side script (perhaps PHP) to query the site, get the diff and then render it's own version with the changes highlighted somehow using HTML/CSS. You can then interact with it using any browser.

Edit: Rereading, sounds like you want to learn C++ rather than have the function, in which case yes absolutely C++ can do it. You will need to look up 'sockets' for the OS you are using for the network code. You may want to look into a library for HTML/XML interpreting if you don't want to write this bit yourself.
Post edited February 09, 2012 by _Bruce_
avatar
Nroug7: Hiya, im self-learning C++ at the moment and i was wondering, is it possible to create a program that can scan over websites details (Text, pargraphs, sentences etc) and look for updates to that website?
In short, yes.

There are probably libraries out there that allow to access web pages and even if there weren't, nothing would prevent you from making one yourself on top of the TPC/IP networking layer.

However, if you're a beginner and not familiar with sockets or the HTTP protocol, not to mention a load of disposable time, you probably don't want to go there [designing your own library].

For the third party libraries, most third party C++ libraries that don't come from official sources (big companies or well recognized open-source communities) tend not to be very usable and bug-prone (C++ is just one of those languages where you need to know what you are doing and have a near-spotless programming methodology to get things right).

My recommendation: Try python. It comes with a very robust standard library.

http://docs.python.org/library/index.html

I believe item 19 of the library will have what you are looking for.
Post edited February 09, 2012 by Magnitus
avatar
Nroug7: Hiya, im self-learning C++ at the moment and i was wondering, is it possible to create a program that can scan over websites details (Text, pargraphs, sentences etc) and look for updates to that website?
Sure, you can create a hash of each page, or if you want to only check for changes to content use a screen scraper and hash the result (which in practice very well may amount to the same thing, depending on how good the screen scraper is).

BTW, doing all of that in C++ is going to be a real pain as a beginner, even opening a socket is hard in C++. You might consider a language that's a little easier for that kind of thing if it's an option (Java is a good choice, C# should be a good choice as well).
Post edited February 09, 2012 by orcishgamer
avatar
adambiser: My knowledge of C++ is not the best, but I'm sure you can do that. As to the difficulty of such a venture, that's a different matter.
avatar
QC: You'd need something more complicated like Java Script, which is what I think my friend uses (He's finishing a bachelor's degree in computer engineering.)
avatar
adambiser: I think you mean Java, which can also do it and, I think, is a bit simpler to do than it is in C++.
Crap, I thought I finally stopped getting mixed up between the two.
avatar
Nroug7: Hiya, im self-learning C++ at the moment and i was wondering, is it possible to create a program that can scan over websites details (Text, pargraphs, sentences etc) and look for updates to that website?
avatar
Magnitus: In short, yes.

There are probably libraries out there that allow to access web pages and even if there weren't, nothing would prevent you from making one yourself on top of the TPC/IP networking layer.

However, if you're a beginner and not familiar with sockets or the HTTP protocol, you probably don't want to go there [designing your own library].

For the third party libraries, most third party C++ libraries that don't come from official sources (big companies or well recognized open-source communities) tend not to be very usable and bug-prone (C++ is just one of those languages where you need to know what you are doing and have a near-spotless programming methodology to get things right).

My recommendation: Try python. It comes with a very robust standard library.

http://docs.python.org/library/index.html

I believe item 19 of the library will have what you are looking for.
For most cases the finer points of HTTP shouldn't be needed a simple request using sockets is all that is required. If you want to support sites that need cookies, javascript, flash or other strange things you are looking for a world of hurt anyway you look at it.
avatar
Magnitus: In short, yes.

There are probably libraries out there that allow to access web pages and even if there weren't, nothing would prevent you from making one yourself on top of the TPC/IP networking layer.

However, if you're a beginner and not familiar with sockets or the HTTP protocol, you probably don't want to go there [designing your own library].

For the third party libraries, most third party C++ libraries that don't come from official sources (big companies or well recognized open-source communities) tend not to be very usable and bug-prone (C++ is just one of those languages where you need to know what you are doing and have a near-spotless programming methodology to get things right).

My recommendation: Try python. It comes with a very robust standard library.

http://docs.python.org/library/index.html

I believe item 19 of the library will have what you are looking for.
avatar
_Bruce_: For most cases the finer points of HTTP shouldn't be needed a simple request using sockets is all that is required. If you want to support sites that need cookies, javascript, flash or other strange things you are looking for a world of hurt anyway you look at it.
He'd also need to know how create a valid HTTP request as well as parse a valid (and possibly invalid) HTTP reply (how the header and body is structured).

If he plans on accessing untrusted servers, he should be very meticulous on how he handles that reply.

Concerning Javascript and cookies, I'd venture to say that the vast majority of web sites use them nowadays.
Post edited February 09, 2012 by Magnitus
avatar
_Bruce_: For most cases the finer points of HTTP shouldn't be needed a simple request using sockets is all that is required. If you want to support sites that need cookies, javascript, flash or other strange things you are looking for a world of hurt anyway you look at it.
avatar
Magnitus: He'd also need to know how create a valid HTTP request as well as parse a valid HTTP reply (how the header and body is structured).

Concerning Javascript and cookies, I'd venture to say that the vast majority of web sites use them nowadays.
The query is very simple it would be easier to just learn/do it than try to find/use a library (also means you can pick any language).

If an untrusted reply is an issue due to poor memory management you are far more like to just crash out than anything else, and will need to be fixed anyway.

Javascript and cookies are /used/ everywhere, but not /required/ everywhere, there is a world of difference.
Post edited February 09, 2012 by _Bruce_
avatar
Magnitus: He'd also need to know how create a valid HTTP request as well as parse a valid HTTP reply (how the header and body is structured).

Concerning Javascript and cookies, I'd venture to say that the vast majority of web sites use them nowadays.
avatar
_Bruce_: The query is very simple it would be easier to just learn/do it than try to find/use a library (also means you can pick any language).

Javascript and cookies are /used/ everywhere, but not /required/ everywhere, there is a world of difference.
They are required if you want to execute the web page the was it was intended to be, but not if you just want to parse it (though he might have to make a seperate request for js files if he wants to parse them as well).

For a purely pedagogical exercise on a trusted server that only supports the simplest of cases, it sounds harmless enough, but at his level, he might still be over his head (just dealing with the sockets alone).

Hard to say.

It really depends on how much time and energy he wants to invest in it.
Post edited February 09, 2012 by Magnitus
avatar
Magnitus: It really depends on how much time and energy he wants to invest in it.
If the intent is to actually learn C++, the task at hand is the least of the problems.
avatar
Magnitus: It really depends on how much time and energy he wants to invest in it.
avatar
_Bruce_: If the intent is to actually learn C++, the task at hand is the least of the problems.
Yes and no.

In this case, I think he might spend more time learning the specifics of the task at hand than he would spend learning C++ in general.
avatar
Nroug7: Hiya, im self-learning C++ at the moment and i was wondering, is it possible to create a program that can scan over websites details (Text, pargraphs, sentences etc) and look for updates to that website?
If you must do something like that in C++, and you are only concerned with text that is directly visible in the page, you can use libcurl and htmlcxx to do the heavy lifting. You call functions in libcurl to retrieve the contents of the page, and functions in htmlcxx to parse the page into its elements. Both libraries are well documented, though in all I would suggest that this is not exactly a beginner's project.

CURL and libcurl: http://www.haxx.se
htmlcxx: http://htmlcxx.sourceforge.net

But if you are trying to learn Visual C++, you have to do things entirely differently. I will leave them for another post, but I will gently suggest that Visual C++ is not actually C++ and will not teach you how to write good, standard, or portable C++.
avatar
orcishgamer: Sure, you can create a hash of each page, or if you want to only check for changes to content use a screen scraper and hash the result (which in practice very well may amount to the same thing, depending on how good the screen scraper is).

BTW, doing all of that in C++ is going to be a real pain as a beginner, even opening a socket is hard in C++. You might consider a language that's a little easier for that kind of thing if it's an option (Java is a good choice, C# should be a good choice as well).
I don't see why it should be that much of a pain in C++. You can use libcurl to fetch the page, create a base hash as you suggest, download the new page using libcurl, hash that and compare the two hashes without ever having to parse the markup.

Simples.
Post edited February 09, 2012 by jamyskis