homepage Welcome to WebmasterWorld Guest from 50.19.169.37
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Most efficient way to search a large text file.
KenB




msg:4076696
 12:37 am on Feb 9, 2010 (gmt 0)

I use a IP database in CSV format to help Geo target ads to users etc. To reduce the CSV file's total size I have stripped it down to its bare bones. Here is some sample data:


3642228736,3642232831,DE
3642232832,3642236927,RS
3642236928,3642241023,CH
3642241024,3642245119,DE
3642245120,3642249215,LV
3642249216,3642253311,FR
3642253312,3642257407,FI
3642257408,3642261503,RU
3642261504,3642265599,GB
3642265600,3642265855,IR


Since primarily what I'm trying to do is turn off certain ads when I'm positive a user isn't in a country supported by a given ad campaign, I have also stripped out all US IP addresses to further reduce the file size and set the country code to "US" if my search doesn't return any results. It isn't 100% accurate, but it is good enough for my purposes.

Since the file still has just under 100,000 records in it, I want to make sure my PHP search function is as efficient as it possibly can be as such I'd like feedback, suggestions and comments for my function below. It is a derivative of a function I found online, which I significantly altered to streamline it.


function ipcountrycode($ip){
$ip=sprintf("%u",ip2long($ip));
$low=0;

$csvfilename="include/ipCC.csv";
$fp=fopen($csvfilename,"r");

fseek($fp,0,SEEK_END);
$high=ftell($fp);
$CC='US';

while($low<=$high){

// Move to mid point between low and high
$mid=floor(($low+$high)/2);
fseek($fp, $mid);

// Moves to end of line
$line=fgets($fp); // Is this actually doing anything of value?

// Turn line into an array.
$ipdata=fgetcsv($fp,26);

// if IP bigger than column 2 then seek higher.
if($ip>$ipdata[1]){$low=$mid+1;}

// if IP smaller than column 1 then seek lower.
elseif($ip<$ipdata[0]){$high=$mid-1;}

// We've found a match.
else{
$CC=$ipdata[2];
$low=999999999;
}

}
fclose($fp);
unset($line);
return $CC;
}


Any feedback would be greatly appreciated.

 

rocknbil




msg:4076755
 3:58 am on Feb 9, 2010 (gmt 0)

Gotta ask man, "most efficient way" . . . why can't you put it in a DB?

KenB




msg:4076763
 4:32 am on Feb 9, 2010 (gmt 0)

I need to conserve database connections. My site takes a lot of hits so if I don't need the database, I don't call it. Basically I'm only allowed so many simultaneous connections. If my site was calling the database for every request, on a busy "real" traffic day if some bad bot or bad user comes along and tries to scrape everything at once I'd get a whole string of failed connections, which isn't good for real users.

Really if it weren't for bad bots and/or bad users trying to scrape my entire site (around 12,000 pages total) all at once every now and then I'd never have any database issues at all.

If I went a dedicated database server, so that my database could withstand a scrapping barrage it would cost me an additional $300 a month. I used to call the database for every request, but as bad bots became more of an issue I redesigned things so I could I could eliminate calling my database for every request.

I've been building my site for over 14 years so it is a mix and match of content that is stored in static files and stuff that is pulled from a database. So not every request truly needs a database call.

I'm also not convinced that there would be any performance improvement if I were only opening a database connection to run one query to return a country code. The web server and database server I use are on different machines (in the same NOC) so there is a certain amount of overhead just making a connection.

One thing that reduces the performance hit of reading a large file from the hard drive is that my site is hosted on a server running FreeBSD, which has a built in method for caching files in memory. Basically the CSV file in question gets called enough that it stays in memory much of the time.

I've been running A vs B testing on my site with and without the call to the country code function, and I can't see a consistently measurable difference in response time with or without the call to the giant file. Its just that I'm running out of big speed hit things to optimize and I'm now going for 10-40 ms here and there.

If I'm reading this function correctly it should have a maximum of 16 iterations to search through 100,000 records. I just want to make sure it is doing what I think it is doing and it is as efficient as I can make it.

jkovar




msg:4076800
 8:00 am on Feb 9, 2010 (gmt 0)

It's a good thing that the data you're looking for, being the 2 character code, is always the same size.
That means simple math can be used to get to the position of the data quickly.

Now if you were to store the 2 character code for every single IP address on a new line,
you would end up with a file roughly 12 GB in size.
However since every every record is two characters long there's no need for a newline/delimiter.
That brings the filesize down to roughly 8 GB in size.

Now completely ignoring that some systems don't like working with filesizes that large,
assume that every single IP address had the 2 character code stored in the file.

When an IP needed to be looked up it would be as simple as using the bcmath PHP extension to double the INT corosponding to the IP address,
since each record is two characters, then a seek could be done to zip directly to the 2 character code.

Now we have to face the problem of systems that don't like files that large.
What we can do is split the file into chunks, then it's only a matter of calculating offsets for however many files the records are split into.
It's still a little bit of simple math and a single seek though.

KenB




msg:4076959
 2:40 pm on Feb 9, 2010 (gmt 0)

Now if you were to store the 2 character code for every single IP address on a new line,
you would end up with a file roughly 12 GB in size.


The total file size for the CSV file is 2,516 KB. Before I stripped out US IPs it was still less than 7 MB. If you look at the sample data you will see there are three columns. The first two columns are IP addresses stored in an integer value. The first column is the low IP address and the second column is the high IP address for a given range. This means each row could represent thousands if not tens of thousands of IP addresses. In the case of my sample data above, the first line represents 4,095 IP addresses.

What the code above is doing is converting the user's IP address into an integer value via the built in PHP function ip2long(). It then sets the high and low row numbers for the file. Each pass of the while loop goes to the mid point between the current low and high value. This row is then read and broken into an array based on the commas. If the user's IP address is bigger than the IP address in column two the low point is set to the current mid value plus one and the loop repeats. If the user's IP address is lower than column one then the high value is set to the current midpoint minus one and the loop repeats. Otherwise we know we found the record we are looking for and the country code is set and the loop terminates.

At least that is how I believe the function is working but I want to be sure and I want to be sure I'm using the most efficient methods possible to complete this search.

CyBerAliEn




msg:4077015
 4:15 pm on Feb 9, 2010 (gmt 0)

Your code is fairly simple. And given it uses the "algorithm" you describe above (in words), it should be fairly efficient. The fact that you claim it can traverse this large file and find the desired code in ~14 iterations indicates it is quite efficient.

Also given that you say you've tested your site A/B (with and without) and no big difference is seen, also indicates that it is efficient.

So all in all, I would be happy and unconcerned with this.


However, as rocknbil points out... working with a database seems like the way to go. But given what you say, processing this text file seems appropriate for the circumstances.

KenB




msg:4077046
 4:59 pm on Feb 9, 2010 (gmt 0)

I'd love to pull more from my database, but I am trying like heck to avoid the need to go to a dedicated database. If it weren't for individuals trying scrape large sections of my site on a regular basis I'd never have database issues. I'd bet I lose 100ms of processing time trying to weed out bad bots to protect core processes.

rocknbil




msg:4077218
 9:24 pm on Feb 9, 2010 (gmt 0)

'K, well, there's considerable performance of DB's over plain text files, and opening/closing plain text files is much harder on a disk, but this is what you have, so there we have it.

I don't have code available, but I would probably try several approaches, or a combination of them:

- Rather than slurp up the entire file with one of the pre-coded PHP functions, which is going to hog up memory like no tomorrow on such a large file (and the corollary, at the worst possible moment) I'd probably read it line by line, exploding as I go, until the match is found.

Aside from what you have there, you can part out these files by some scheme, into smaller files, this will save you some time. For example, a dictionary could contain A's in file a.csv, etc. Making the files smaller may render huge file parsing irrelevant.

TheMadScientist




msg:4077222
 9:35 pm on Feb 9, 2010 (gmt 0)

I'd bet I lose 100ms of processing time trying to weed out bad bots to protect core processes.


1.) Where are you weeding them out in PHP or in the .htaccess?

2.) Are you running one of the bad-bot scripts posted here that only allow a certain number of requests per time period and/or a honey-pot ban script to keep them from coming back?

3.) If I was to break up the file I would probably think about a comparison to select the correct file or something to the effect of the following example.


$ip=sprintf("%u",ip2long($ip));
if($ip>File1Start && $ip<File1End) {
$csvfilename="include/ipCC1.csv";
}
elseif($ip>File2Start && $ip<File2End) {
$csvfilename="include/ipCC2.csv";
}


Just some thoughts I'm having...

TheMadScientist




msg:4077233
 9:55 pm on Feb 9, 2010 (gmt 0)

You might actually be able to do some more filtering in the file selection and cut down on the comparisions:


$ip=sprintf("%u",ip2long($ip));
if($ip<File_Set_1_End) {
if($ip<File_1_End) {
$csvfilename="include/ipCC1.csv";
}
elseif($ip<File_2_End) {
$csvfilename="include/ipCC2.csv";
}
elseif($ip<File_3_End) {
$csvfilename="include/ipCC3.csv";
}
elseif($ip<File_4_End) {
$csvfilename="include/ipCC4.csv";
}
elseif($ip<File_5_End) {
$csvfilename="include/ipCC5.csv";
}
}
elseif($ip<File_Set_2_End) {
if($ip<File_6_End) {
$csvfilename="include/ipCC6.csv";
}
elseif($ip<File_7_End) {
$csvfilename="include/ipCC7.csv";
}
elseif($ip<File_8_End) {
$csvfilename="include/ipCC8.csv";
}
elseif($ip<File_9_End) {
$csvfilename="include/ipCC9.csv";
}
elseif($ip<File_10_End) {
$csvfilename="include/ipCC10.csv";
}
}


Anyway, just another thought while I'm sitting here thinking about how I would try to get to the right place in a smaller file faster and I'm sure you can see what I'm getting at...

KenB




msg:4077237
 10:04 pm on Feb 9, 2010 (gmt 0)

I have a truncated UA block list in my .htaccess file see: [webmasterworld.com...]

I also use a series of IP range block lists in PHP. One list is a straight Class B or Class C block list, which should be fast and efficient enough, as they are simple strpos() functions and the blocked IP classes are in the haystack. The second IP block list in PHP matches on sub sections of an IP range. For example:


if(
($aIPClass['B']=="173.212." && $aIPParts[2]>=192) || //#*$!#*$! Network Operations Center Inc 2009-12-19
($aIPClass['B']=="173.45." && $aIPParts[2]>=64 && $aIPParts[3]<=127) || //Columbus #*$!x.#*$! Inc
($aIPClass['B']=="174.34." && $aIPParts[2]>=128 && $aIPParts[2]<=191) || //#*$!#*$!xservers.com
($aIPClass['B']=="194.110." && $aIPParts[2]>=160 && $aIPParts[2]<=163) || // #*$!#*$!x Host 2009-12-19
($aIPClass['B']=="194.110." && $aIPParts[2]>=160 && $aIPParts[2]<=163) || // #*$!#*$!xx Host 2009-12-19
($aIPClass['B']=="204.236." && $aIPParts[2]>=128) || // Amazon Web Services (dynamic hosting) 2009-12-22
...55 total lines

I've obscured some of the party names. I'm sure there is a more efficient way to do this, but I also want it to be easy for me to read and update.

The final IP block list is another strpos() match where the haystack contains the exact IP address that was blocked.

I also do a few other minor blocking methods like checking for certain types of behaviors via PHP.

I think I'm losing 30-40ms on my .htaccess list (I can't really benchmark exact figures) and a similar amount from the PHP side. I used to use more robust countermeasures, but I've pared back what I do to speed up overall page response times. Right now Google WMT is giving me an average googlebot response time of 394ms for the past 90 days. My hope is to get my average response times down below 350ms. I'd love to get my average below 300ms but I'm not sure this is realistic.

KenB




msg:4077254
 10:23 pm on Feb 9, 2010 (gmt 0)

Anyway, just another thought while I'm sitting here thinking about how I would try to get to the right place in a smaller file faster and I'm sure you can see what I'm getting at...

I appreciate the thoughts. It would be helpful to toss ideas back and forth. The one question is, is it faster to call one monolithic file, especially if the OS will get in the habit of keeping it in memory if it is called enough or is it faster to call a series of smaller files. A bunch of smaller files also become harder to maintain when it comes time to update my IP CSV file.

I can't remember the name of the search pattern I'm doing, but I do remember it is one of the faster ways to search big data sets. Basically each doubling in size of the dataset only increases the maximum number of iterations needed to complete a search by one.

  • 2 records: 2 max iterations
  • 4 records: 3 max iterations
  • 8 records: 4 max iterations
  • 16 records: 5 max iterations
  • 32 records: 6 max iterations
  • 64 records: 7 max iterations
  • 128 records: 8 max iterations
  • 256 records: 9 max iterations
  • 512 records: 10 max iterations
  • 1024 records: 11 max iterations
  • 2048 records: 12 max iterations
  • 4096 records: 13 max iterations
  • 8192 records: 14 max iterations
  • 16384 records: 15 max iterations
  • 32768 records: 16 max iterations
  • 65536 records: 17 max iterations
  • 131072 records: 18 max iterations
  • 262144 records: 19 max iterations
  • 524288 records: 20 max iterations
  • 1048576 records: 21 max iterations

As you can see you could search over a million records with a maximum of only 21 cycles of the while statement. The reason is that each pass of the while statement cuts the remaining records to sort through in half. So on the first pass of 1 million records you can eliminate 500,000 records from the search.

The question really boils down to what is the fastest way to read the file into memory and then jump forward and back in the file as you progress through the loop.

jkovar




msg:4077286
 11:02 pm on Feb 9, 2010 (gmt 0)

Sorry, I thought you wanted the most efficient way to solve the problem. I didn't realize there were hardware limitations. :)

brotherhood of LAN




msg:4077298
 11:19 pm on Feb 9, 2010 (gmt 0)

Binary format may speed up the searching because it would take less space; 8 bytes for 2 IPs + 2 char country codes.

Also, how about having a fixed length index, that would not involve 'searching'. You could have 255 values that indicate where in the file unique class A's begin for the lowerbound value

1.*.*.* = 1st record
2.*.*.* = 50th record
3.*.*.* = 324th record etc

The IP would be 4 bytes, and since you have 100000 records, the offset would have to be at least 3 bytes.

Since it'd be fixed length you can just multiply the class A by 7 bytes (4 byte IP + 3 byte offset value)

You may also want to have a 256th value to indicate the length of the file. Searching for 2.*.*.*, the index can tell you to start searching between the 50th and 323rd record.

KenB




msg:4077299
 11:21 pm on Feb 9, 2010 (gmt 0)

Not so much of hardware limitations as ISP limitations. They throw a limit on simultaneous connections for databases on shared servers. I've lobbied them in the past to increase the limit with no success. :(

Any how, the thread did give me some food for thought and here is one change I made. I broke the monolithic file up into 4 equal parts and then added the following if statement to decide which file gets opened:


if($ip<3259317248){
if($ip<1439557888){$csvfilename=$strRoot."include/WritableFiles/ip2country-1.csv";}
else{$csvfilename=$strRoot."include/WritableFiles/ip2country-2.csv";}
}else{
if($ip<3399547952){$csvfilename=$strRoot."include/WritableFiles/ip2country-3.csv";}
else{$csvfilename=$strRoot."include/WritableFiles/ip2country-4.csv";}
}
$fp=fopen($csvfilename,"r");

The result is that the script will always do two if checks to decide which of the four files it will open. The integer values represent the first value in the second half of the given break. The result is instead of opening a 2.7mb file it is now opening a 645,000kb file. This should be much nicer to the server.

TheMadScientist




msg:4077300
 11:26 pm on Feb 9, 2010 (gmt 0)

especially if the OS will get in the habit of keeping it in memory if it is called enough


If you're getting it into RAM I would leave it as is, or possibly eliminate the least requested sections and move them into another file if there are enough to make it worth the time... The biggest slowdown I know of personally is going to the disk, so if you can get the whole thing into RAM you should be running at optimum speed AFAIK. I often run a single database table / index when possible because of the 'hot caching' and if I could do it with my server I would there too. In DBs the more accessed a file is, the closer to the front of memory it moves, so if your server is doing the same with files, then IMO keep it the way it is and at most move the 'rarely used' locations to another file, which could be 'super light' and save an iteration or two on the bigger file.

IMO Iterations are not nearly as important or time consuming as getting into the RAM and if it's already there, you could actually slow down things for some users by separating it into smaller files, unless you can make it so the least used locations are separated from the most used locations, then you could maybe have a smaller file(s) in the RAM and speed things up a bit, but you would probably need to 'hit the disk' for the less used locations.

You might gain some 'most user' speed if you went with one main 'big file' in RAM and smaller ones for the less used locations using a conditional to select the correct file to access, but then you run into the maintenance issue you were talking about.

[edited by: TheMadScientist at 11:35 pm (utc) on Feb 9, 2010]

KenB




msg:4077301
 11:27 pm on Feb 9, 2010 (gmt 0)

Binary format may speed up the searching because it would take less space; 8 bytes for 2 IPs + 2 char country codes.

Remember I have to be able to update this CSV file from time to time and it downloads with the IPs in the normal 255.255.255.255 formats and as integer values. I then strip off the standard IP address format to reduce file size. It's not critical that this be 100% accurate but it still needs to be updated at least once a year. If I were to convert the IP addresses to binary format I'd need some automated conversion for that.

Readie




msg:4077341
 12:57 am on Feb 10, 2010 (gmt 0)

If I were to convert the IP addresses to binary format I'd need some automated conversion for that

PhP has you covered :)

decimal => binary
[uk.php.net...]

binary => decimal
[uk.php.net...]

brotherhood of LAN




msg:4077361
 1:44 am on Feb 10, 2010 (gmt 0)

For binary you'd more likely want to use the pack & unpack functions, bin2dec and dec2bin return 0/1 string representations of binary.

KenB




msg:4077434
 5:23 am on Feb 10, 2010 (gmt 0)

I don't think converting to binary will save me anything. However converting an integer to a hexadecimal would reduce the length of each row by around eight characters. This would result in reducing the total file size by around 781 kb

NOTE: in a post above I mention that each of the four files I split my CSV file into were 645,000kb. I should have said they were 645kb. I kind of misread things earlier.

KenB




msg:4078606
 8:17 pm on Feb 11, 2010 (gmt 0)

Thanks to the advice in the thread at [webmasterworld.com...] I learned about something called xDebug and have been using it to profile my scripts including my geo location function that started this whole thread. I'm happy to report that on my laptop, which I'm sure has a slower hard drive than the server I'm using, the function above with the CSV file broken into four files as I describe above only takes 6.6ms to process. I guess I can't ask for better than that.

Thanks for everyone's suggestions.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved