Why content “grabbing” is different from normal “browsing”
Some readers defended SPH on the ground that its employee or bot might be simply “browsing” our site for “research” purposes and blamed us for hanging him/her out to dry.
However, content “grabbing” is not equivalent to innocucous “browsing” which explained why we raised the alarm bells based on the advice given by our system administrator.
Attached below are two snapshots of our log to explain the difference between the two:
#1 Snapshot from the incident:

As we can see from above, the same IP address 203.116.231.234 which was traced back to SPH was logged to be simultaneously connecting to the server at a VERY SHORT interval, hence the IP was repeatedly logged immediately one after the other.
This is the characteristic of a web grabber kind of software (it may also be sort of a SYNC attack but a SYNC attack would be grabbing the same content instead of multiple), certainly not any browser’s characteristic.
#2 Snapshot taken on 6 November 2009, 9pm during “normal” browsing:

A normal browser reading the site would also show the same IP address as accessing the site but not as repeatative and as close in timing as shown in the snapshot above in #1.
The IP address of the reader will still be logged but the interval between connects would be greater and not one after another.
In the highlighted example, they would be 202.156.13.246 and 202.156.12.228 which are accessing the site ‘normally’.
Since Mr Pereira had admitted that they were SPH employees found to be visiting TR during the time period when the “grabbing” incident as shown in snapshot #1 was alleged to take place, he would be able to answer the three most important questions we have been asking all along:
1. The identity of the employee who was “grabbing” content from TR during the stated time period.
2. Is he/she using a web grabber software to do so?
3. What are his/her motives for “grabbing” our site.
All we asked for is an explanation of what really happened. We will close the case once we get the answers we are waiting for.
12 Responses to “Why content “grabbing” is different from normal “browsing””
Alex Tan Allan Ooi AWARE Chee Soon Juan Chiam See Tong Claire Lee David Widjaja DBS Dr Allan Ooi Dr Silviu Ionescu Dr Vivian Balakrishnan Foyce Le Xuan highnote5 Hong Lim Park Jack Lin Xinli Jack Neo Jack Neo affair Jack Neo scandal Josie Lau Josie Lau Meng Lee Lee Kuan Yew Lehman brothers Lighthouse Evangelism MAS minibonds Miss Singapore World NTU stabbing PAP Pastor Rony Tan Ris Low Romanian diplomat in hit-and-run Rony Tan S-League silviu ionescu Singapore Singapore 2010 Youth Olympic Games Tan Kin Lian Thio Su Mien Tiger Woods affair Tong Kok Wai Top 8 Vivian Balakrishnan Wendy Chong Y O G Youth Olympic Games
WP Cumulus Flash tag cloud by Roy Tanck and Luke Morton requires Flash Player 9 or better.








@admin
am no IT wizard, still trying to digest the technicality. so ..just for clarity-sake. here, u said:
“This is the characteristic of a web grabber kind of software (it may also be sort of a SYNC attack but a SYNC attack would be grabbing the same content instead of multiple), certainly not any browser’s characteristic.”
..and the other article, u said:
“Without considering which idiot in the world would want to spoof ONLY ONE IP address to grab TR’s site and incriminate SPH, let’s get down to the technical aspects of IP spoofing.”
with the above, r u implying the following?
it’s possible that an idiot so stoopidly spoof only one IP address and launch a variant of SYNC attack (variant: grab multiple address, instead of just one).
deoxin, IP spoofing is not technically possible in this incident, the IP address grabbing the site is NOT spoofed.
I have written some technical details explaining why it wasn’t a spoofed IP, its available in another thread on the same matter.
@Sinkapore
okay! i think ..i know what u said there now.
u r saying that .. the machine with SPH’s IP-address made multiple connections almost simultaneously to TR’s many-links.
This is plausible only if it’s done by a “grabber program” that retrieve the many-links (to connect to) from TR’s website itself ..which implies that the machine DID received replies from TR’s website.
If the machine’s IP was spoofed, the machine wouldn’t have gotten the replies in the first place (and hence, cannot determine the many-links to connect to, especially the links to the old articles).
Correct? -)
Stoopid SPH!
Yes deoxin, thats EXACTLY what I meant to say. Thank you for rephrasing it in such a clear manner.
One point needs clarification thou, concurrently would be a better word in replacement for simultaneously since Apache’s (the web server) keepalive features would have allowed that. So a single IP address can technically make 100s of connections to the web server requesting for information within a VERY VERY VERY short period of time, measured in nano seconds.
Picture yourself owning a store that can only allow 100 customers into the premise at a time. When the initial 100 customers have already entered the store, other customers will be ‘locked’ out and put into a queue, therefore slowing down their access due to the queue, or wait.
Hopefully readers will better understand.
Nope, there’s a way to build a map of the site without the site noticing. It’s possible with automatic scripts through many proxies over a few days.
But yah, who in the right frame of mind want to impersonate a grabber? Grabber they grab, so what?
My background is in IT security and I have been following this incident closely. One thing for SURE is that the IP address shown is indeed INVOLVED in CONTENT GRABBING. No other explanation can deny this fact.
On the second issue, whether IP spoofing is involved or NOT, this can be technically traced. If SPH is willing to proof that its IP address is spoofed by others (means SPH is NOT guilty), it can do so by getting IT security experts help. But if SPH can ONLY DENY without proper facts and proof, I think it is being NOT professional.
The ball is at SPH court. So far, it is being handled TOTALLY in VERY AMATEUR manner.
@ObserverOne, there was NEVER any doubt in my mind that the IP addresses IS GUILTY of grabbing/rip TR’s site after seeing the server logs handed to me by TR’s system administrator although I am not in the IT Security line, just like to ‘play’ with servers.
On the second issue, I agree that it can be traced and quite easilly at that. Assuming that the IP address was indeed spoofed, TR’s servers on the date and time mentioned would have sent hundreds of response, syncookies and ack to the REAL IP address. That would ironically, also create a flurry of connects to SPH’s servers.
Some readers just don’t understand that although it is perfectly legal for a person or entity to grab a site, but the fact that it was SPH makes a difference. One would want to question and know WHY would SPH, a corporate media GIANT do that? It can’t be for offline browsing, can it?
Sinkapore -> you deride fellow readers as not understanding the case, where the real matter is one of differing values. Note that you’ve agreed on the legality of a person/entity in “grab”-ing a site (though i’d point out that the usage of grab is an emotive word, technically fair & consistent, but otherwise aimed at swaying an audience used to standard english). Many would not see any problem with SPH or CIA or Al Queda choosing to collect TR webpages. Given TR’s earlier DDOS complaint, it would seem fair to believe that the reliability of TR’s webhosting was in question, and that offline archival is more reliable than TR’s webhost services. The underlying value difference is in whether we the reader, impute any ill-motives to SPH or any other online entity. To TR supporters, all actions from other are fraught with suspicions of ill motives. As a party without interest in TR or SPH, this dispute is merely hot air.
This incident shows me more about TR and its internal processes than anything about SPH. TR basically had a knee-jerk reaction, initially imputing a continuation of DDOS, followed by tabloid-style headlines, followed by a backdown to a minor, toned-down, challenge to SPH to do internal investigations on specious no-damage grounds, followed by an attempt to justify the moral high ground of its series of actions. I like the transparency of the series of events, but it doesn’t give me much confidence in TR management that has appeared to hold only its internal views, which appears to be overly critical and cynical about all else. And hasn’t seemed to be that open to robust competing views to its own. Greater diversity and robustness, with a cohesive and sensitive viewpoint would generate TR more value.
dogbert, you keep playing like an old record the aim of which is to ridicule TR’s admin by pretending to reply to my postings.
Well nice try! Ain’t going to work cause I find it tiring to reply to a broken down record, repeating itself over and over and over and over and over again.
So I will stop responding to you UNLESS you have a fresh set of argument to put forth.
Hi Sinkapore,
I read with interest on this article because I am a system admin myself. I agree that the screenshot likely shows a machine grabbing the web pages from TR. This is due to the absence of requests that often makes up the rest of HTML pages, like images and javascript files.
However I cannot conclude without a doubt that there are concurrent connections to the TR website. This is because some important details is missing in the cropped screenshot to pinpoint that SPH is indeed opening multiple concurrent connections spaced within nano seconds to the website. Would TR be willing to post the full screenshot in #1 instead of the cropped screenshot?
The reason why is this. What your cropped screenshot shows is only the last request fulfilled by the Apache process. Apache typically operates on a pre-fork model meaning there is a number of spare Apache child processes which is ready to handle any incoming request. If the number of incoming request is low, any of the spare Apache chile process can service the request. Hence if your Apache process is idle and waiting for another request, it will still show the last request in the Apache status(it will not be blanked out).
The only way to see if all your Apache processes are currently busy responding to the requests is to look at some additional details which is on top of your cropped screenshot. This shows the state of each Apache child request to see if it is idle or servicing a request. In addition, to the left of your cropped screenshot, there is a column of data called as ‘SS’ which is the number of seconds since the last request. The numbers in this column needs to be all very close to each other(within a range of 1-2) for it to imply that all these requests come in at around the same time. If not you may have just jumped to the wrong conclusion.
Regards.
Hello IronMan, thank you for your lesson on apache. I am quite familiar with apache having been fooling with its initial 1.0 version to date, 2.2.
I understand fully what you are asking for but you, like others are really mising the point AGAIN. What TR needs t prove is that SPH was grabbing their website, PERIOD.
Whether the grab caused any increase in load of the server is irrelevant and not something which TR is interested in.
Considering that SPH is now MUM being threased by readers and screwing itself, the admin have decided that it is no longer needed to show anything more than necessary unless SPH responds again and when that happens, TR will throw the full server logs at them.
Thanks.
@Sinkapore … what a lot of tosh … ‘the admin have decided that it is no longer needed to show anything more than necessary unless SPH responds’.
This is a complete cop-out and, in fact, implies the TR has something to hide. IronMan had a perfectly valid, and tech-based, request.
I have no understanding of any of the IT / Server stuff … but in this instance it just looks like TR is now trying to cover their arse.
Why not post the full information … that way heaps more people in the IT industry who really understand this stuff can clearly see what happened.
Who knows, they might show TR that their webhost company hasn’t been doing a good job, and therefore save them some money or something.