Forum Moderators: open
The data set is very large (2.3 million world place names) so the tab-delimited source text file is 140MB. Very slow to open on my laptop! This 'two week project' has stretched out to a month and a half -- mostly because I wanted to prepopulate the database with this locations info.
He said they needed to break the data into smaller chunks first. Then a week ago he said that it would be all done by Saturday. Now it's Monday and he says that they have only imported 40MB so far, have 100MB to go. Fine, there is not a huge rush on this project but I am wondering if these guys are totally incompetent.
This is all just on their test server. Am I going to have to wait weeks for them to set it up on my server when it's ready? Comments welcome, thanks.
Something with around 2 million+ records / 140Mb could be done in minutes quite easily.
The problem could be if
a) the hardware they are using is not upto scratch (not enough ram, too much disk activity)
b) the data is badly presented and needs to be massaged first (different date formats half way through the list)
I get the impression they are not bothered about your request and are treating it as low priority or have lost the data.
If you still have the data do it yourself. Post up the specs of DB, O/S, hardware, and you'll get better assistance here.
We are using Windows 2000 server with IIS and around 1 GB of Ram for server.
Sir people whom you have consulted might be using some other language C++ etc which is faster and handles much bigger files.
Php has many limitation with the size of file.
Once files are splitted (which we are doing) in smaller size data import won't take much time.
its basically splitting files into smaller sized files taking time.
sir we are not delaying this intentionally.
Does this make sense?
I'm running Apache, will this be a problem considering they are doing this on Windows 2000 server?
It sounds to me like you're paying someone to move a pile of dirt from point a to point b... And when you ask why it is taking so long they respond with something like:
We are using a wheelbarrow with 5 cubic feet capacity. Sir people whom you have consulted might be using a shovel to load/unload the dirt into/from the wheelbarrow, respectively. Our tablespoon that we are using to load/unload the dirt into/from the wheelbarrow has limitation with the amount of dirt we can move in one scoop. Once we get the wheelbarrow loaded it won't take much time. It's basically using a spoon instead of a shovel that is taking time. Sir we are not delaying this intentionally.
ROFL... If what they are saying is true about PhP, it sounds to me like, as they practically admitted, they simply picked the wrong tool to load your data for which you are likely getting charged by the hour... If what they are saying about the limitations of PhP are NOT true, then it sounds like they know not what they speak of.
[edited by: ZydoSEO at 5:55 am (utc) on Feb. 14, 2008]
It just requires a simple
load data local infile "yourdata.txt" into table ....
mysql would chew over that quicker than they could type their response email to you.
[dev.mysql.com...]
Assuming the data is in a well-know text format (comma-separated values, tab-separated values, etc.) it should be possible to handle it with your database manager's "import" facility, without having to write any code.
I load an MySQL database with 800,000 records (a dump from SQLite) in one table and 600,000 in another in < 5 minutes. The tables have maybe 20 fields each. The database occupies maybe 80MB.
Once files are splitted (which we are doing) in smaller size data import won't take much time.
its basically splitting files into smaller sized files taking time.
This makes absolutely no sense at all. There's no sense in splitting the source file(s). Loading the database is a sequential process. You read one line of data, you stick it into the database.
Even if they DID have to split the file into smaller ones, again, that's a few minutes work with the right tools. On a linux system it could be done with command-line tools in minutes.
Sir, the technical term for what you are experiencing is "a snow job".
BTW, if you are running on Windows, you might want to get the utility "V". It's a very fast file viewer that will view any kind of file (including binary). It's ideal for viewing large text dump files like this, if you need to get in and explore the data. Search for "V file viewer".
Here is the latest response:
1.code will work in apache.
2.http://dev.mysql.com/doc/refman/5.0/en/load-data.html will not work in our case as its not just about importing data as it is.
3. we did search for softwares that could automate splitting files but could not find any which could split files by lines all we found could split files in sizes which surely won't work in this case.
4. feeding data in database was never part of this project we are doing this only to compensate the extra time this project took.
5. About "Snow Job" , sir we don't believe in giving such excuses. if we are not working on your project then we will surely say that we are not working on it and not give excuse that data import is taking time. Moreover how can you say that we are lying ? Are you here everyday with us checking everything that we are doing ?.
6. About the tools for managing text files on linux we do not have idea as we do not use linux OS. so can not comment on that.we are towards completion.
We assure you that it will be complete this week itself.
Thanks for your cooperation.
OK, well I guess they just aren't that experienced (like me). I'll just have to wait and see. They've had a hard time with this from the get go, I've had to explain a few things about the data that should have been obvious.
Any pointers on the file splitting software?
[edited by: Tourz at 6:44 pm (utc) on Feb. 15, 2008]
You don't seem to know much if anything about the design yourself.
It sounds as if you've defined your widget in broad strokes, and expect them to come up with a widget.
If you are not a software developer yourself, you probably should hire one (separate from your offshore developers) to oversee the design. You need somebody with a higher-level viewpoint than a coder has.
It's really hard to tell if they are being unreasonable or not, since you've given us so little information.
Let's back up a bit. You've got a lit of 2.3 million world place names. What is it that you want to do with it?
We've all been assuming that you simply need the place names loaded into a table. From the response, there's more than that going on.
As far as splitting files, it is trivial. Using the Linux/Unix split command:
split -l 1000 myfile.txt
will split myfile.txt into files of 1000 lines called myfileaa, myfileab, myfileac, etc.
You can get implementations of split that will run on Windows, or there are numerous packages of Linux-like command-line utilities available for Windows. The most popular is Cygwin. While Cygwin is overkill for this, it's a handy tool to have if you are stuck in a Windows world - you get pretty much a complete set of Linux command-line tools, which are pretty handy if you are working with text files a lot.
And in any case, simply writing a command-line program to split a file into pieces is a one-hour job, tops, with bells and whistles.
But there's no need to re-invent this simple tool!
I still don't see the point of splitting the file. The names have to be fed to the database one at a time. No difference if you read 1000 lines each from 1000 files or 1,000,000 lines from one file. It will take the same amount of time.
Not being able to locate a utility to split a text file into chunks of size "n" shows an appalling lack of knowledge. Not about programming, mind you, but simple use of the Internet!
Given a search engine, anybody's 12 year old kid should be able to accomplish this in a few minutes.
Are you here everyday with us checking everything that we are doing ?.
And that, I'm afraid is the biggest problem with offshoring. You really need to start with a small project and assess capabilities and responsiveness before you dive into a big project. You need to establish a LOT of trust, and you can't really do so using some coders-for-hire website's rating system, as they are too easy to game. You need to do it individually, and incrementally.
Actually the project is well defined: we are cloning an apartment rental directory.
I made the project a bit more complicated because I wanted to have the list of possible locations prepopulated with place names so I did not have to input every city etc. I want people to be able to select their location from a series of drop down menus.
The breadcrumb navigation trail on each page will reflect the locations heirarchy (ie continent, subregion, country, state, city)
I consider this to be a small test project. They are doing it for $400, plus a bonus for this extra work.
Your tip about hiring an expert here to oversee the oversea project is a good one, thanks. What would you charge for something like this?