Why content “grabbing” is different from normal “browsing”

Some readers defended SPH on the ground that its employee or bot might be simply “browsing” our site for “research” purposes and blamed us for hanging him/her out to dry.

However, content “grabbing” is not equivalent to innocucous “browsing” which explained why we raised the alarm bells based on the advice given by our system administrator.

Attached below are two snapshots of our log to explain the difference between the two:

#1 Snapshot from the incident:

Initial Log

As we can see from above, the same IP address 203.116.231.234 which was traced back to SPH was logged to be simultaneously connecting to the server at a VERY SHORT interval, hence the IP was repeatedly logged immediately one after the other.

This is the characteristic of a web grabber kind of software (it may also be sort of a SYNC attack but a SYNC attack would be grabbing the same content instead of multiple), certainly not any browser’s characteristic.

#2 Snapshot taken on 6 November 2009, 9pm during “normal” browsing:

Normal Log

A normal browser reading the site would also show the same IP address as accessing the site but not as repeatative and as close in timing as shown in the snapshot above in #1.

The IP address of the reader will still be logged but the interval between connects would be greater and not one after another.

In the highlighted example, they would be 202.156.13.246 and 202.156.12.228 which are accessing the site ‘normally’.

Since Mr Pereira had admitted that they were SPH employees found to be visiting TR during the time period when the “grabbing” incident as shown in snapshot #1 was alleged to take place, he would be able to answer the three most important questions we have been asking all along:

1. The identity of the employee who was “grabbing” content from TR during the stated time period.

2. Is he/she using a web grabber software to do so?

3. What are his/her motives for “grabbing” our site.

All we asked for is an explanation of what really happened. We will close the case once we get the answers we are waiting for.

  • Share/Bookmark
Related Posts

12 Responses to “Why content “grabbing” is different from normal “browsing””

  • deoxin:

    @admin

    am no IT wizard, still trying to digest the technicality. so ..just for clarity-sake. here, u said:
    “This is the characteristic of a web grabber kind of software (it may also be sort of a SYNC attack but a SYNC attack would be grabbing the same content instead of multiple), certainly not any browser’s characteristic.”

    ..and the other article, u said:
    “Without considering which idiot in the world would want to spoof ONLY ONE IP address to grab TR’s site and incriminate SPH, let’s get down to the technical aspects of IP spoofing.”

    with the above, r u implying the following?
    it’s possible that an idiot so stoopidly spoof only one IP address and launch a variant of SYNC attack (variant: grab multiple address, instead of just one).

  • deoxin, IP spoofing is not technically possible in this incident, the IP address grabbing the site is NOT spoofed.

    I have written some technical details explaining why it wasn’t a spoofed IP, its available in another thread on the same matter.

  • deoxin:

    @Sinkapore

    okay! i think ..i know what u said there now.

    u r saying that .. the machine with SPH’s IP-address made multiple connections almost simultaneously to TR’s many-links.
    This is plausible only if it’s done by a “grabber program” that retrieve the many-links (to connect to) from TR’s website itself ..which implies that the machine DID received replies from TR’s website.

    If the machine’s IP was spoofed, the machine wouldn’t have gotten the replies in the first place (and hence, cannot determine the many-links to connect to, especially the links to the old articles).

    Correct? -)
    Stoopid SPH!

  • Yes deoxin, thats EXACTLY what I meant to say. Thank you for rephrasing it in such a clear manner.

    One point needs clarification thou, concurrently would be a better word in replacement for simultaneously since Apache’s (the web server) keepalive features would have allowed that. So a single IP address can technically make 100s of connections to the web server requesting for information within a VERY VERY VERY short period of time, measured in nano seconds.

    Picture yourself owning a store that can only allow 100 customers into the premise at a time. When the initial 100 customers have already entered the store, other customers will be ‘locked’ out and put into a queue, therefore slowing down their access due to the queue, or wait.

    Hopefully readers will better understand.

  • rolleyes:

    Nope, there’s a way to build a map of the site without the site noticing. It’s possible with automatic scripts through many proxies over a few days.

    But yah, who in the right frame of mind want to impersonate a grabber? Grabber they grab, so what?

  • ObserverOne:

    My background is in IT security and I have been following this incident closely. One thing for SURE is that the IP address shown is indeed INVOLVED in CONTENT GRABBING. No other explanation can deny this fact.

    On the second issue, whether IP spoofing is involved or NOT, this can be technically traced. If SPH is willing to proof that its IP address is spoofed by others (means SPH is NOT guilty), it can do so by getting IT security experts help. But if SPH can ONLY DENY without proper facts and proof, I think it is being NOT professional.

    The ball is at SPH court. So far, it is being handled TOTALLY in VERY AMATEUR manner.

  • @ObserverOne, there was NEVER any doubt in my mind that the IP addresses IS GUILTY of grabbing/rip TR’s site after seeing the server logs handed to me by TR’s system administrator although I am not in the IT Security line, just like to ‘play’ with servers. :)

    On the second issue, I agree that it can be traced and quite easilly at that. Assuming that the IP address was indeed spoofed, TR’s servers on the date and time mentioned would have sent hundreds of response, syncookies and ack to the REAL IP address. That would ironically, also create a flurry of connects to SPH’s servers. :(

    Some readers just don’t understand that although it is perfectly legal for a person or entity to grab a site, but the fact that it was SPH makes a difference. One would want to question and know WHY would SPH, a corporate media GIANT do that? It can’t be for offline browsing, can it?

  • dogbert:

    Sinkapore -> you deride fellow readers as not understanding the case, where the real matter is one of differing values. Note that you’ve agreed on the legality of a person/entity in “grab”-ing a site (though i’d point out that the usage of grab is an emotive word, technically fair & consistent, but otherwise aimed at swaying an audience used to standard english). Many would not see any problem with SPH or CIA or Al Queda choosing to collect TR webpages. Given TR’s earlier DDOS complaint, it would seem fair to believe that the reliability of TR’s webhosting was in question, and that offline archival is more reliable than TR’s webhost services. The underlying value difference is in whether we the reader, impute any ill-motives to SPH or any other online entity. To TR supporters, all actions from other are fraught with suspicions of ill motives. As a party without interest in TR or SPH, this dispute is merely hot air.

    This incident shows me more about TR and its internal processes than anything about SPH. TR basically had a knee-jerk reaction, initially imputing a continuation of DDOS, followed by tabloid-style headlines, followed by a backdown to a minor, toned-down, challenge to SPH to do internal investigations on specious no-damage grounds, followed by an attempt to justify the moral high ground of its series of actions. I like the transparency of the series of events, but it doesn’t give me much confidence in TR management that has appeared to hold only its internal views, which appears to be overly critical and cynical about all else. And hasn’t seemed to be that open to robust competing views to its own. Greater diversity and robustness, with a cohesive and sensitive viewpoint would generate TR more value.

  • dogbert, you keep playing like an old record the aim of which is to ridicule TR’s admin by pretending to reply to my postings.

    Well nice try! Ain’t going to work cause I find it tiring to reply to a broken down record, repeating itself over and over and over and over and over again.

    So I will stop responding to you UNLESS you have a fresh set of argument to put forth.

  • IronMan:

    Hi Sinkapore,

    I read with interest on this article because I am a system admin myself. I agree that the screenshot likely shows a machine grabbing the web pages from TR. This is due to the absence of requests that often makes up the rest of HTML pages, like images and javascript files.

    However I cannot conclude without a doubt that there are concurrent connections to the TR website. This is because some important details is missing in the cropped screenshot to pinpoint that SPH is indeed opening multiple concurrent connections spaced within nano seconds to the website. Would TR be willing to post the full screenshot in #1 instead of the cropped screenshot?

    The reason why is this. What your cropped screenshot shows is only the last request fulfilled by the Apache process. Apache typically operates on a pre-fork model meaning there is a number of spare Apache child processes which is ready to handle any incoming request. If the number of incoming request is low, any of the spare Apache chile process can service the request. Hence if your Apache process is idle and waiting for another request, it will still show the last request in the Apache status(it will not be blanked out).

    The only way to see if all your Apache processes are currently busy responding to the requests is to look at some additional details which is on top of your cropped screenshot. This shows the state of each Apache child request to see if it is idle or servicing a request. In addition, to the left of your cropped screenshot, there is a column of data called as ‘SS’ which is the number of seconds since the last request. The numbers in this column needs to be all very close to each other(within a range of 1-2) for it to imply that all these requests come in at around the same time. If not you may have just jumped to the wrong conclusion.

    Regards.

  • Hello IronMan, thank you for your lesson on apache. I am quite familiar with apache having been fooling with its initial 1.0 version to date, 2.2.

    I understand fully what you are asking for but you, like others are really mising the point AGAIN. What TR needs t prove is that SPH was grabbing their website, PERIOD.

    Whether the grab caused any increase in load of the server is irrelevant and not something which TR is interested in.

    Considering that SPH is now MUM being threased by readers and screwing itself, the admin have decided that it is no longer needed to show anything more than necessary unless SPH responds again and when that happens, TR will throw the full server logs at them.

    Thanks.

  • amused:

    @Sinkapore … what a lot of tosh … ‘the admin have decided that it is no longer needed to show anything more than necessary unless SPH responds’.

    This is a complete cop-out and, in fact, implies the TR has something to hide. IronMan had a perfectly valid, and tech-based, request.

    I have no understanding of any of the IT / Server stuff … but in this instance it just looks like TR is now trying to cover their arse.

    Why not post the full information … that way heaps more people in the IT industry who really understand this stuff can clearly see what happened.

    Who knows, they might show TR that their webhost company hasn’t been doing a good job, and therefore save them some money or something.

Search Our Site
Scrolling Bulletin Board
Sponsor Our Site for $5 a day
http://www.cosme-de.com/SG Love Testhttp://www.tradekey.com/
YesStyleJShoppersOctupus Travel
Recent Comments
  • LPPL: 李卖蚬, you are good and funny. Anyway, your 解读 are highly likely what he meant when he said...
  • Winston Cheng: I had attended CSJ’s rally once. At one point, some people were shouting at the police...
  • New_threat: @patriot No point arguing with sensiblePR. Nothing we can do except suck thumb.
  • Terence: @Goh You retarded or what? Go to baidu.com and search for such postings yourself. There are tons...
  • Sianzcitizen: In the first place, why offer bonds to PRCs or foreign students…? Can’t we just...
  • Screwuli: The more I see where RP is heading and the people they attract, the more I am convince that they...
  • wa kong sui sui laio: sui sui liao pap tao toa toa liao tua ki liao siao liao jipye see liao reform party...
  • Fact or Fiction: Ah yes & HDB keep claiming they make loss in yearly report. What loss they are talkin...
  • Pink Man: @atobe: Alamak! Stupid me. I always noticed that he walks weirdly but have never think much abt...
  • anti-dictator: [Quota] Our immigration dept acting as fairy god parents to these PRs are too docile. Every...
  • Karma: This had happened for decade when was a student in my university day. Seriously cant accept why...
  • J: I am sure any Singaporean who has academic ability that is equivalent or anywhere close to that of the...
  • SureOrNot?: @scoff: September 3, 2010 at 10:24 am The policy makers don’t really want us to boost...
  • Goh: The above postings is a fake! The Chinese language used and the sentence structure is definitely not...
  • LIM PAY SAY HO SAY: HO SAY LIOW, SAY COW E BEH JIN CHU
  • Battle akan datang: So relieved to see so many strong oppositions this time round. Now, PAPayas can’t...
  • jo: It is not illegal to ‘burn’ CDs into your PC. It would be illegal to ‘burn’ CDs...
  • alvinlwh: They have to be careful. Same as the YOG group, they have to be careful of “invites for...
  • James Lai: 开源节流!
  • Screwuli: All your anger is frustration, should be channel into real action. By donating to an opposition...
  • I got my rights: @sensiblePR we have different culture and resources we will manage our family problem do...
  • Demoralised..: this guy is a joke..
  • reform party pls: reform party pls don’t come yishun .because yishun is super clean now.. back in...
  • 妈包蛋: 妈包蛋, Without a doubt, You are the king of the screw-up minister among your peers. No one...
  • Marionette Lolita: @china pr veteran your trolling fails to amuse me. good try though, keep it up. =D
  • Screwuli: I know you are a traitor in WW2 but I don’t know you are also a modern day pimp.
  • Moles aplenty: Beware of moles, RP! Many have sold their souls to the PAPayas! So, be extra careful.
  • errol: Ladies and gentlemen, its great to hear your clamour for the oppositions but realistically what is...
  • I got my rights: JUst an example FT talking to us right now fierce man ….there will be more argument...
  • K: She is a role model for “unmotivated”, “lazy” singaporeans. All Singaporean...
  • So, you also bad la!: @CHN PR Veteran Calling yourself a ‘good PR’? Then why are you defending...
  • Halleluyah: @ patriot: September 3, 2010 at 8:39 pm “thought I heard from BBC the plane crashed into...
  • dear lee xian loonnggg: license casino den, prositutes at hdb, prc gang clash.murder at kallang . singapore...
  • Patriot: @Sunrise: September 3, 2010 at 7:18 pm “Hypothetically, if I were a politician I would field...
  • HDB excess unit is because it is very expensive: @ BAH: September 3, 2010 at 8:23 pm It may be true that...
  • FeverGuy: Should get all TR facebook users to support RP as helpers?
  • HDBlover: @BAH Ah Mah keeping these flats in secret because they are reserved for the FT and PR pets of the...
  • Lan Jiao: Bloody hell! NS was alot tougher compared to now. Last time where got SFI? And all other welfare...
  • patriot: @Hmmmmmmmm….. I thought I heard from BBC the plane crashed into the south china sea and...
  • sia lan kia: @china pr veteran oh fuck ‘i fucked your family again don’t hide behind you...
  • AhTo: All the Singaporeans who married foreigners should divorce!
  • veron: I saw a pin-up job flyer in CDC Toa Payoh. I was desperately searching for job, as I had been...
  • patriot: I wonder which retarded scholar or minister came out with these retarded schemes??
  • coffeetok: Most of the gripes are coming from first time home buyers who blame high COV for driving up...
Support Our Site


Weekly Newsletter
Subscribe with your email address.

TR’s Official Host
Site Statistics
Latest Statistic
User Registration
Online Poll

Come GE, you will vote for:

View Results

Loading ... Loading ...
Statbadge
Stadtbadge
Statbadge by www.teledir.de
Sponsored Ads
Sponsored Advertisement
Tag Cloud