Forum Moderators: coopster
I want to crawl through all the files from many folders , classify the data in a manner and then put it into database.
I have renamed the folders and files with number so that i can use range(); and foreach() .
The main problems are :
The files are not in standard format.
The 6th line is fixed that there will be the document number.
But the 7th line can be a blank , or can be the name of person or company.
The 8th line if the 7th line is the name will be the name of business , or can be the status of company or can even be the address of company!
I am totally confused how to do this. Also it is not so that there will be only one name in the document , as on 7th line. There can also be a name on the 8th line , 9th line or even 11th line.
Any ideas for overcoming all these 'problems' ?
Welcome to WebmasterWorld with your puzzling problem.
Sounds like you're going to have to make some fuzzy gray logic to process the files that will get 99% of the way and visually review for final corrections.
Some assumptions can be made such as names don't typically start with numbers, so that could be an address otherwise a name.
Obviously it will fail with company names like "21st Widget" or something like that.
I'm assuming the "STATUS" of the company is some code that could also be easily identifiable?
So you need to be flexible after line 6 and attempt to build records based on the content of the following lines, attempt to detect a new name, etc.
Doesn't sound impossible, just a little iffy that it'll be 100% accurate.
I would probably be inclined to check for known abbreviations, address pieces and 'street types' for address lines EG st, street, way, pl, blvd, ave, avenue, ste, suite, box, etc.
The reason is although there are not too many, business names do sometimes start with numbers, once you identify addresses you could use a numeric value match to identify business name. (You could probably also check for 'NUM' + 'One of the Preceding Abbreviations' to get a more accurate match.)
I would probably also check for inc., LLC, etc. as known business type abbreviations.
IMO You will probably have to process multiple times and work from the most possible matches to the least... This might not be the exact order, but an example would be to find the status codes first and process those files, then files with known addresses using some variation of the preceding, then check for numbers within a 'name line' and you will know with a fair degree of accuracy you have the name of a business, and process those... Anyway this should hopefully give you a start and eliminate some of the possibilities from trying to match personal name or business name.